org.apache.lucene.analysis.cjk
Class CJKTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.cjk.CJKTokenizer
- All Implemented Interfaces:
- Closeable
public final class CJKTokenizer
- extends org.apache.lucene.analysis.Tokenizer
CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
The tokens returned are every two adjacent characters with overlap match.
Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".
Additionally, the following is applied to Latin text (such as English):
- Text is converted to lowercase.
- Numeric digits, '+', '#', and '_' are tokenized as letters.
- Full-width forms are converted to half-width forms.
For more info on Asian language (Chinese, Japanese, and Korean) text segmentation:
please search google
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Methods inherited from class org.apache.lucene.analysis.Tokenizer |
close, correctOffset |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
CJKTokenizer
public CJKTokenizer(Reader in)
- Construct a token stream processing the given input.
- Parameters:
in
- I/O reader
CJKTokenizer
public CJKTokenizer(org.apache.lucene.util.AttributeSource source,
Reader in)
CJKTokenizer
public CJKTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
Reader in)
incrementToken
public boolean incrementToken()
throws IOException
- Returns true for the next token in the stream, or false at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Returns:
- false for end of stream, true otherwise
- Throws:
IOException
- - throw IOException when read error
happened in the InputStream
end
public final void end()
- Overrides:
end
in class org.apache.lucene.analysis.TokenStream
reset
public void reset()
throws IOException
- Overrides:
reset
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset(Reader reader)
throws IOException
- Overrides:
reset
in class org.apache.lucene.analysis.Tokenizer
- Throws:
IOException
Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.