Class JapaneseTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

public final class JapaneseTokenizer extends Tokenizer
Tokenizer for Japanese that uses morphological analysis.

This tokenizer sets a number of additional attributes:

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.

  • Field Details

  • Constructor Details

    • JapaneseTokenizer

      public JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
      Create a new JapaneseTokenizer.

      Uses the default AttributeFactory.

      Parameters:
      userDictionary - Optional: if non-null, user dictionary.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      mode - tokenization mode.
    • JapaneseTokenizer

      public JapaneseTokenizer(UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
      Create a new JapaneseTokenizer.

      Uses the default AttributeFactory.

      Parameters:
      userDictionary - Optional: if non-null, user dictionary.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
      mode - tokenization mode.
    • JapaneseTokenizer

      public JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
      Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.
      Parameters:
      factory - the AttributeFactory to use
      userDictionary - Optional: if non-null, user dictionary.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      mode - tokenization mode.
    • JapaneseTokenizer

      public JapaneseTokenizer(AttributeFactory factory, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
      Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.
      Parameters:
      factory - the AttributeFactory to use
      userDictionary - Optional: if non-null, user dictionary.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
      mode - tokenization mode.
    • JapaneseTokenizer

      public JapaneseTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, boolean discardPunctuation, boolean discardCompoundToken, JapaneseTokenizer.Mode mode)
      Create a new JapaneseTokenizer, supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to DictionaryBuilder.
      Parameters:
      factory - the AttributeFactory to use
      systemDictionary - a custom known token dictionary
      unkDictionary - a custom unknown token dictionary
      connectionCosts - custom token transition costs
      userDictionary - Optional: if non-null, user dictionary.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
      mode - tokenization mode.
      WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Method Details