Class HMMChineseTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offset
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionCreates a new HMMChineseTokenizerHMMChineseTokenizer
(AttributeFactory factory) Creates a new HMMChineseTokenizer, supplying the AttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
void
reset()
protected void
setNextSentence
(int sentenceStart, int sentenceEnd) Methods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
HMMChineseTokenizer
public HMMChineseTokenizer()Creates a new HMMChineseTokenizer -
HMMChineseTokenizer
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
-
Method Details
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd) - Specified by:
setNextSentence
in classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()- Specified by:
incrementWord
in classSegmentingTokenizerBase
-
reset
- Overrides:
reset
in classSegmentingTokenizerBase
- Throws:
IOException
-