org.apache.lucene.analysis.ko.KoreanTokenizer

All Implemented Interfaces:: Closeable, AutoCloseable

public final class KoreanTokenizer extends Tokenizer

Tokenizer for Korean that uses morphological analysis.

This tokenizer sets a number of additional attributes:

PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading.

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

KoreanTokenizer.DecompoundMode

Decompound mode: this determines how the tokenizer handles POS.Type.COMPOUND, POS.Type.INFLECT and POS.Type.PREANALYSIS tokens.

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

static final KoreanTokenizer.DecompoundMode

DEFAULT_DECOMPOUND

Default mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.

Fields inherited from class org.apache.lucene.analysis.Tokenizer
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

KoreanTokenizer()

Creates a new KoreanTokenizer with default parameters.

KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)

Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.

KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)

Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.

KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)

Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
Method Summary

Modifier and Type

Method

Description

void

close()

void

end()

boolean

incrementToken()

void

reset()

void

setGraphvizFormatter(GraphvizFormatter<KoMorphData> dotOut)

Expert: set this to produce graphviz (dot) output of the Viterbi lattice

Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- DEFAULT_DECOMPOUND
  
  public static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
  
  Default mode for the decompound of tokens (KoreanTokenizer.DecompoundMode.DISCARD.
Constructor Details
- KoreanTokenizer
  
  public KoreanTokenizer()
  
  Creates a new KoreanTokenizer with default parameters.
  Uses the default AttributeFactory.
- KoreanTokenizer
  
  public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
  
  Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
  
  Parameters:
  
  factory - the AttributeFactory to use
  
  userDictionary - Optional: if non-null, user dictionary.
  
  mode - Decompound mode.
  
  outputUnknownUnigrams - if true outputs unigrams for unknown words.
- KoreanTokenizer
  
  public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
  
  Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
  
  Parameters:
  
  factory - the AttributeFactory to use
  
  userDictionary - Optional: if non-null, user dictionary.
  
  mode - Decompound mode.
  
  outputUnknownUnigrams - if true outputs unigrams for unknown words.
  
  discardPunctuation - true if punctuation tokens should be dropped from the output.
- KoreanTokenizer
  
  public KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
  
  Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary. This constructor provides an entry point for users that want to construct custom language models that can be used as input to DictionaryBuilder.
  
  Parameters:
  
  factory - the AttributeFactory to use
  
  systemDictionary - a custom known token dictionary
  
  unkDictionary - a custom unknown token dictionary
  
  connectionCosts - custom token transition costs
  
  userDictionary - Optional: if non-null, user dictionary.
  
  mode - Decompound mode.
  
  outputUnknownUnigrams - if true outputs unigrams for unknown words.
  
  discardPunctuation - true if punctuation tokens should be dropped from the output.
  
  WARNING: This API is experimental and might change in incompatible ways in the next release.
Method Details
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Overrides:
  
  close in class Tokenizer
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class Tokenizer
  
  Throws:
  
  IOException
- end
  
  public void end() throws IOException
  
  Overrides:
  
  end in class TokenStream
  
  Throws:
  
  IOException
- incrementToken
  
  public boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- setGraphvizFormatter
  
  public void setGraphvizFormatter(GraphvizFormatter<KoMorphData> dotOut)
  
  Expert: set this to produce graphviz (dot) output of the Viterbi lattice

Class KoreanTokenizer

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.Tokenizer

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

DEFAULT_DECOMPOUND

Constructor Details

KoreanTokenizer

KoreanTokenizer

KoreanTokenizer

KoreanTokenizer

Method Details

close

reset

end

incrementToken

setGraphvizFormatter