org.apache.lucene.analysis.cjk
Class CJKTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.cjk.CJKTokenizer
public final class CJKTokenizer
- extends Tokenizer
CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
The tokens returned are every two adjacent characters with overlap match.
Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".
Additionally, the following is applied to Latin text (such as English):
- Text is converted to lowercase.
- Numeric digits, '+', '#', and '_' are tokenized as letters.
- Full-width forms are converted to half-width forms.
For more info on Asian language (Chinese, Japanese, and Korean) text segmentation:
please search google
Fields inherited from class org.apache.lucene.analysis.Tokenizer |
input |
Method Summary |
void |
end()
This method is called by the consumer after the last token has been
consumed, after TokenStream.incrementToken() returned false
(using the new TokenStream API). |
boolean |
incrementToken()
Returns true for the next token in the stream, or false at EOS. |
void |
reset()
Resets this stream to the beginning. |
void |
reset(Reader reader)
Expert: Reset the tokenizer to a new reader. |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
CJKTokenizer
public CJKTokenizer(Reader in)
- Construct a token stream processing the given input.
- Parameters:
in
- I/O reader
CJKTokenizer
public CJKTokenizer(AttributeSource source,
Reader in)
CJKTokenizer
public CJKTokenizer(AttributeSource.AttributeFactory factory,
Reader in)
incrementToken
public boolean incrementToken()
throws IOException
- Returns true for the next token in the stream, or false at EOS.
See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html
for detail.
- Overrides:
incrementToken
in class TokenStream
- Returns:
- false for end of stream, true otherwise
- Throws:
IOException
- - throw IOException when read error
happened in the InputStream
end
public final void end()
- Description copied from class:
TokenStream
- This method is called by the consumer after the last token has been
consumed, after
TokenStream.incrementToken()
returned false
(using the new TokenStream
API). Streams implementing the old API
should upgrade to use this feature.
This method can be used to perform any end-of-stream operations, such as
setting the final offset of a stream. The final offset of a stream might
differ from the offset of the last token eg in case one or more whitespaces
followed after the last token, but a WhitespaceTokenizer
was used.
- Overrides:
end
in class TokenStream
reset
public void reset()
throws IOException
- Description copied from class:
TokenStream
- Resets this stream to the beginning. This is an optional operation, so
subclasses may or may not implement this method.
TokenStream.reset()
is not needed for
the standard indexing process. However, if the tokens of a
TokenStream
are intended to be consumed more than once, it is
necessary to implement TokenStream.reset()
. Note that if your TokenStream
caches tokens and feeds them back again after a reset, it is imperative
that you clone the tokens when you store them away (on the first pass) as
well as when you return them (on future passes after TokenStream.reset()
).
- Overrides:
reset
in class TokenStream
- Throws:
IOException
reset
public void reset(Reader reader)
throws IOException
- Description copied from class:
Tokenizer
- Expert: Reset the tokenizer to a new reader. Typically, an
analyzer (in its reusableTokenStream method) will use
this to re-use a previously created tokenizer.
- Overrides:
reset
in class Tokenizer
- Throws:
IOException
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.