Class JapaneseTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class JapaneseTokenizer
    extends Tokenizer
    Tokenizer for Japanese that uses morphological analysis.

    This tokenizer sets a number of additional attributes:

    This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.

    • Constructor Detail

      • JapaneseTokenizer

        public JapaneseTokenizer​(UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer.

        Uses the default AttributeFactory.

        Parameters:
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        mode - tokenization mode.
      • JapaneseTokenizer

        public JapaneseTokenizer​(AttributeFactory factory,
                                 UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer.
        Parameters:
        factory - the AttributeFactory to use
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        mode - tokenization mode.