public final class KoreanTokenizer extends Tokenizer
This tokenizer sets a number of additional attributes:
PartOfSpeechAttribute
containing part-of-speech.
ReadingAttribute
containing reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
Modifier and Type | Class and Description |
---|---|
static class |
KoreanTokenizer.DecompoundMode
Decompound mode: this determines how the tokenizer handles
POS.Type.COMPOUND , POS.Type.INFLECT and POS.Type.PREANALYSIS tokens. |
static class |
KoreanTokenizer.Type
Token type reflecting the original source of this token
|
AttributeSource.State
Modifier and Type | Field and Description |
---|---|
static KoreanTokenizer.DecompoundMode |
DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (
KoreanTokenizer.DecompoundMode.DISCARD . |
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams)
Create a new KoreanTokenizer.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams,
boolean discardPunctuation)
Create a new KoreanTokenizer.
|
Modifier and Type | Method and Description |
---|---|
void |
close() |
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
|
correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
KoreanTokenizer.DecompoundMode.DISCARD
.public KoreanTokenizer()
Uses the default AttributeFactory.
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
factory
- the AttributeFactory to useuserDictionary
- Optional: if non-null, user dictionary.mode
- Decompound mode.outputUnknownUnigrams
- If true outputs unigrams for unknown words.public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
factory
- the AttributeFactory to useuserDictionary
- Optional: if non-null, user dictionary.mode
- Decompound mode.outputUnknownUnigrams
- If true outputs unigrams for unknown words.discardPunctuation
- true if punctuation tokens should be dropped from the output.public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void close() throws IOException
close
in interface Closeable
close
in interface AutoCloseable
close
in class Tokenizer
IOException
public void reset() throws IOException
reset
in class Tokenizer
IOException
public void end() throws IOException
end
in class TokenStream
IOException
public boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.