Class KoreanTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class KoreanTokenizer
    extends Tokenizer
    Tokenizer for Korean that uses morphological analysis.

    This tokenizer sets a number of additional attributes:

    This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.

    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Constructor Detail

      • KoreanTokenizer

        public KoreanTokenizer()
        Creates a new KoreanTokenizer with default parameters.

        Uses the default AttributeFactory.

      • KoreanTokenizer

        public KoreanTokenizer​(AttributeFactory factory,
                               UserDictionary userDictionary,
                               KoreanTokenizer.DecompoundMode mode,
                               boolean outputUnknownUnigrams)
        Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
        Parameters:
        factory - the AttributeFactory to use
        userDictionary - Optional: if non-null, user dictionary.
        mode - Decompound mode.
        outputUnknownUnigrams - if true outputs unigrams for unknown words.
      • KoreanTokenizer

        public KoreanTokenizer​(AttributeFactory factory,
                               UserDictionary userDictionary,
                               KoreanTokenizer.DecompoundMode mode,
                               boolean outputUnknownUnigrams,
                               boolean discardPunctuation)
        Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
        Parameters:
        factory - the AttributeFactory to use
        userDictionary - Optional: if non-null, user dictionary.
        mode - Decompound mode.
        outputUnknownUnigrams - if true outputs unigrams for unknown words.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
      • KoreanTokenizer

        public KoreanTokenizer​(AttributeFactory factory,
                               TokenInfoDictionary systemDictionary,
                               UnknownDictionary unkDictionary,
                               ConnectionCosts connectionCosts,
                               UserDictionary userDictionary,
                               KoreanTokenizer.DecompoundMode mode,
                               boolean outputUnknownUnigrams,
                               boolean discardPunctuation)
        Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to DictionaryBuilder.
        Parameters:
        factory - the AttributeFactory to use
        systemDictionary - a custom known token dictionary
        unkDictionary - a custom unknown token dictionary
        connectionCosts - custom token transition costs
        userDictionary - Optional: if non-null, user dictionary.
        mode - Decompound mode.
        outputUnknownUnigrams - if true outputs unigrams for unknown words.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        WARNING: This API is experimental and might change in incompatible ways in the next release.