org.apache.lucene.analysis.cjk
Class CJKTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.cjk.CJKTokenizer

public final class CJKTokenizer
extends org.apache.lucene.analysis.Tokenizer

CJKTokenizer is designed for Chinese, Japanese, and Korean languages.

The tokens returned are every two adjacent characters with overlap match.

Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

Additionally, the following is applied to Latin text (such as English): For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
CJKTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory, Reader in)
           
CJKTokenizer(org.apache.lucene.util.AttributeSource source, Reader in)
           
CJKTokenizer(Reader in)
          Construct a token stream processing the given input.
 
Method Summary
 void end()
           
 boolean incrementToken()
          Returns true for the next token in the stream, or false at EOS.
 void reset()
           
 void reset(Reader reader)
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
getOnlyUseNewAPI, next, next, setOnlyUseNewAPI
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

CJKTokenizer

public CJKTokenizer(Reader in)
Construct a token stream processing the given input.

Parameters:
in - I/O reader

CJKTokenizer

public CJKTokenizer(org.apache.lucene.util.AttributeSource source,
                    Reader in)

CJKTokenizer

public CJKTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
                    Reader in)
Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException
Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Overrides:
incrementToken in class org.apache.lucene.analysis.TokenStream
Returns:
false for end of stream, true otherwise
Throws:
IOException - - throw IOException when read error
happened in the InputStream

end

public final void end()
Overrides:
end in class org.apache.lucene.analysis.TokenStream

reset

public void reset()
           throws IOException
Overrides:
reset in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

reset

public void reset(Reader reader)
           throws IOException
Overrides:
reset in class org.apache.lucene.analysis.Tokenizer
Throws:
IOException


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.