CJKTokenizer (Lucene 3.1.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.cjk
Class CJKTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.cjk.CJKTokenizer

All Implemented Interfaces:: Closeable

public final class CJKTokenizer
extends org.apache.lucene.analysis.Tokenizer
extends org.apache.lucene.analysis.Tokenizer

CJKTokenizer is designed for Chinese, Japanese, and Korean languages.

The tokens returned are every two adjacent characters with overlap match.

Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

Additionally, the following is applied to Latin text (such as English):

Text is converted to lowercase.
Numeric digits, '+', '#', and '_' are tokenized as letters.
Full-width forms are converted to half-width forms.

For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State`

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`CJKTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory, Reader in)`
`CJKTokenizer(org.apache.lucene.util.AttributeSource source, Reader in)`
`CJKTokenizer(Reader in)` Construct a token stream processing the given input.

Method Summary
`void`	`end()`
`boolean`	`incrementToken()` Returns true for the next token in the stream, or false at EOS.
`void`	`reset()`
`void`	`reset(Reader reader)`

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

CJKTokenizer

public CJKTokenizer(Reader in)

Construct a token stream processing the given input.

Parameters:: in - I/O reader

CJKTokenizer

public CJKTokenizer(org.apache.lucene.util.AttributeSource source,
                    Reader in)

CJKTokenizer

public CJKTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
                    Reader in)

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Specified by:: incrementToken in class org.apache.lucene.analysis.TokenStream

Returns:: false for end of stream, true otherwise
Throws:: IOException - - throw IOException when read error
happened in the InputStream

end

public final void end()

Overrides:: end in class org.apache.lucene.analysis.TokenStream

reset

public void reset()
           throws IOException

Overrides:: reset in class org.apache.lucene.analysis.TokenStream

Throws:: IOException

reset

public void reset(Reader reader)
           throws IOException

Overrides:: reset in class org.apache.lucene.analysis.Tokenizer

Throws:: IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.cjk Class CJKTokenizer

CJKTokenizer

CJKTokenizer

CJKTokenizer

incrementToken

end

reset

reset

org.apache.lucene.analysis.cjk
Class CJKTokenizer