Class KoreanTokenizer

All Implemented Interfaces:
Closeable, AutoCloseable

public final class KoreanTokenizer extends Tokenizer
Tokenizer for Korean that uses morphological analysis.

This tokenizer sets a number of additional attributes:

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.

WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

  • Constructor Details

    • KoreanTokenizer

      public KoreanTokenizer()
      Creates a new KoreanTokenizer with default parameters.

      Uses the default AttributeFactory.

    • KoreanTokenizer

      public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
      Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
      Parameters:
      factory - the AttributeFactory to use
      userDictionary - Optional: if non-null, user dictionary.
      mode - Decompound mode.
      outputUnknownUnigrams - if true outputs unigrams for unknown words.
    • KoreanTokenizer

      public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
      Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
      Parameters:
      factory - the AttributeFactory to use
      userDictionary - Optional: if non-null, user dictionary.
      mode - Decompound mode.
      outputUnknownUnigrams - if true outputs unigrams for unknown words.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
    • KoreanTokenizer

      public KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
      Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to DictionaryBuilder.
      Parameters:
      factory - the AttributeFactory to use
      systemDictionary - a custom known token dictionary
      unkDictionary - a custom unknown token dictionary
      connectionCosts - custom token transition costs
      userDictionary - Optional: if non-null, user dictionary.
      mode - Decompound mode.
      outputUnknownUnigrams - if true outputs unigrams for unknown words.
      discardPunctuation - true if punctuation tokens should be dropped from the output.
      WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Method Details