Class JapaneseTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class JapaneseTokenizer
    extends Tokenizer
    Tokenizer for Japanese that uses morphological analysis.

    This tokenizer sets a number of additional attributes:

    This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.

    • Constructor Detail

      • JapaneseTokenizer

        public JapaneseTokenizer​(UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer.

        Uses the default AttributeFactory.

        Parameters:
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        mode - tokenization mode.
      • JapaneseTokenizer

        public JapaneseTokenizer​(UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 boolean discardCompoundToken,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer.

        Uses the default AttributeFactory.

        Parameters:
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
        mode - tokenization mode.
      • JapaneseTokenizer

        public JapaneseTokenizer​(AttributeFactory factory,
                                 UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.
        Parameters:
        factory - the AttributeFactory to use
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        mode - tokenization mode.
      • JapaneseTokenizer

        public JapaneseTokenizer​(AttributeFactory factory,
                                 UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 boolean discardCompoundToken,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer using the system and unknown dictionaries shipped with Lucene.
        Parameters:
        factory - the AttributeFactory to use
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
        mode - tokenization mode.
      • JapaneseTokenizer

        public JapaneseTokenizer​(AttributeFactory factory,
                                 TokenInfoDictionary systemDictionary,
                                 UnknownDictionary unkDictionary,
                                 ConnectionCosts connectionCosts,
                                 UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 boolean discardCompoundToken,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer, supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to DictionaryBuilder.
        Parameters:
        factory - the AttributeFactory to use
        systemDictionary - a custom known token dictionary
        unkDictionary - a custom unknown token dictionary
        connectionCosts - custom token transition costs
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        discardCompoundToken - true if compound tokens should be dropped from the output when tokenization mode is not NORMAL.
        mode - tokenization mode.
        WARNING: This API is experimental and might change in incompatible ways in the next release.