public final class KoreanTokenizer extends Tokenizer
This tokenizer sets a number of additional attributes:
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
| Modifier and Type | Class and Description |
|---|---|
static class |
KoreanTokenizer.DecompoundMode
Decompound mode: this determines how the tokenizer handles
POS.Type.COMPOUND, POS.Type.INFLECT and POS.Type.PREANALYSIS tokens. |
static class |
KoreanTokenizer.Type
Token type reflecting the original source of this token
|
AttributeSource.State| Modifier and Type | Field and Description |
|---|---|
static KoreanTokenizer.DecompoundMode |
DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (
KoreanTokenizer.DecompoundMode.DISCARD. |
DEFAULT_TOKEN_ATTRIBUTE_FACTORY| Constructor and Description |
|---|
KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams)
Create a new KoreanTokenizer.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams,
boolean discardPunctuation)
Create a new KoreanTokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
void |
close() |
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
|
correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringpublic static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
KoreanTokenizer.DecompoundMode.DISCARD.public KoreanTokenizer()
Uses the default AttributeFactory.
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
factory - the AttributeFactory to useuserDictionary - Optional: if non-null, user dictionary.mode - Decompound mode.outputUnknownUnigrams - If true outputs unigrams for unknown words.public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
factory - the AttributeFactory to useuserDictionary - Optional: if non-null, user dictionary.mode - Decompound mode.outputUnknownUnigrams - If true outputs unigrams for unknown words.discardPunctuation - true if punctuation tokens should be dropped from the output.public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void close()
throws IOException
close in interface Closeableclose in interface AutoCloseableclose in class TokenizerIOExceptionpublic void reset()
throws IOException
reset in class TokenizerIOExceptionpublic void end()
throws IOException
end in class TokenStreamIOExceptionpublic boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionCopyright © 2000-2019 Apache Software Foundation. All Rights Reserved.