CJKTokenizer (Lucene 3.4.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.cjk
Class CJKTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.cjk.CJKTokenizer

All Implemented Interfaces:: Closeable

public final class CJKTokenizer
extends Tokenizer
extends Tokenizer

CJKTokenizer is designed for Chinese, Japanese, and Korean languages.

The tokens returned are every two adjacent characters with overlap match.

Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

Additionally, the following is applied to Latin text (such as English):

Text is converted to lowercase.
Numeric digits, '+', '#', and '_' are tokenized as letters.
Full-width forms are converted to half-width forms.

For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`CJKTokenizer(AttributeSource.AttributeFactory factory, Reader in)`
`CJKTokenizer(AttributeSource source, Reader in)`
`CJKTokenizer(Reader in)` Construct a token stream processing the given input.

Method Summary
`void`	`end()` This method is called by the consumer after the last token has been consumed, after `TokenStream.incrementToken()` returned `false` (using the new `TokenStream` API).
`boolean`	`incrementToken()` Returns true for the next token in the stream, or false at EOS.
`void`	`reset()` Resets this stream to the beginning.
`void`	`reset(Reader reader)` Expert: Reset the tokenizer to a new reader.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Constructor Detail

CJKTokenizer

public CJKTokenizer(Reader in)

Construct a token stream processing the given input.

Parameters:: in - I/O reader

CJKTokenizer

public CJKTokenizer(AttributeSource source,
                    Reader in)

CJKTokenizer

public CJKTokenizer(AttributeSource.AttributeFactory factory,
                    Reader in)

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Specified by:: incrementToken in class TokenStream

Returns:: false for end of stream, true otherwise
Throws:: IOException - - throw IOException when read error
happened in the InputStream

end

public final void end()

Description copied from class: TokenStream

This method is called by the consumer after the last token has been consumed, after TokenStream.incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

Overrides:: end in class TokenStream

reset

public void reset()
           throws IOException

Description copied from class: TokenStream

Resets this stream to the beginning. This is an optional operation, so subclasses may or may not implement this method. TokenStream.reset() is not needed for the standard indexing process. However, if the tokens of a TokenStream are intended to be consumed more than once, it is necessary to implement TokenStream.reset(). Note that if your TokenStream caches tokens and feeds them back again after a reset, it is imperative that you clone the tokens when you store them away (on the first pass) as well as when you return them (on future passes after TokenStream.reset()).

Overrides:: reset in class TokenStream

Throws:: IOException

reset

public void reset(Reader reader)
           throws IOException

Description copied from class: Tokenizer

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Overrides:: reset in class Tokenizer

Throws:: IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.cjk Class CJKTokenizer

CJKTokenizer

CJKTokenizer

CJKTokenizer

incrementToken

end

reset

reset

org.apache.lucene.analysis.cjk
Class CJKTokenizer