org.apache.lucene.analysis.ja
Class JapaneseTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.ja.JapaneseTokenizer
- All Implemented Interfaces:
- Closeable
public final class JapaneseTokenizer
- extends Tokenizer
Tokenizer for Japanese that uses morphological analysis.
This tokenizer sets a number of additional attributes:
This tokenizer uses a rolling Viterbi search to find the
least cost segmentation (path) of the incoming characters.
For tokens that appear to be compound (> length 2 for all
Kanji, or > length 7 for non-Kanji), we see if there is a
2nd best segmentation of that token after applying
penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH
, we output the alternate segmentation
as well.
Nested Class Summary |
static class |
JapaneseTokenizer.Mode
Tokenization mode: this determines how the tokenizer handles
compound and unknown words. |
static class |
JapaneseTokenizer.Type
Token type reflecting the original source of this token |
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
DEFAULT_MODE
public static final JapaneseTokenizer.Mode DEFAULT_MODE
- Default tokenization mode. Currently this is
JapaneseTokenizer.Mode.SEARCH
.
JapaneseTokenizer
public JapaneseTokenizer(Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
- Create a new JapaneseTokenizer.
- Parameters:
input
- Reader containing textuserDictionary
- Optional: if non-null, user dictionary.discardPunctuation
- true if punctuation tokens should be dropped from the output.mode
- tokenization mode.
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
- Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenStream
- Throws:
IOException
end
public void end()
- Overrides:
end
in class TokenStream
incrementToken
public boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.