public final class JapaneseTokenizer extends Tokenizer
This tokenizer sets a number of additional attributes:
BaseFormAttribute containing base form for inflected
adjectives and verbs.
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading and pronunciation.
InflectionAttribute containing additional part-of-speech
information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the
least cost segmentation (path) of the incoming characters.
For tokens that appear to be compound (> length 2 for all
Kanji, or > length 7 for non-Kanji), we see if there is a
2nd best segmentation of that token after applying
penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation
as well.
| Modifier and Type | Class and Description |
|---|---|
static class |
JapaneseTokenizer.Mode
Tokenization mode: this determines how the tokenizer handles
compound and unknown words.
|
static class |
JapaneseTokenizer.Type
Token type reflecting the original source of this token
|
AttributeSource.AttributeFactory, AttributeSource.State| Modifier and Type | Field and Description |
|---|---|
static JapaneseTokenizer.Mode |
DEFAULT_MODE
Default tokenization mode.
|
| Constructor and Description |
|---|
JapaneseTokenizer(AttributeSource.AttributeFactory factory,
Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.
|
JapaneseTokenizer(Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
|
close, correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreStatepublic static final JapaneseTokenizer.Mode DEFAULT_MODE
JapaneseTokenizer.Mode.SEARCH.public JapaneseTokenizer(Reader input, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
Uses the default AttributeFactory.
input - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.public JapaneseTokenizer(AttributeSource.AttributeFactory factory, Reader input, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)
factory - the AttributeFactory to useinput - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void reset()
throws IOException
reset in class TokenStreamIOExceptionpublic void end()
end in class TokenStreampublic boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionCopyright © 2000-2013 Apache Software Foundation. All Rights Reserved.