ICUTokenizer (Lucene 4.2.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.icu.segmentation.ICUTokenizer

All Implemented Interfaces:: Closeable

public final class ICUTokenizer
extends Tokenizer
extends Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:: ICUTokenizerConfig
WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`ICUTokenizer(Reader input)` Construct a new ICUTokenizer that breaks text into words from the given Reader.
`ICUTokenizer(Reader input, ICUTokenizerConfig config)` Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

Method Summary
`void`	`end()`
`boolean`	`incrementToken()`
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset, setReader`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

ICUTokenizer

public ICUTokenizer(Reader input)

Construct a new ICUTokenizer that breaks text into words from the given Reader.

The default script-specific handling is used.

Parameters:: input - Reader containing text to tokenize.
See Also:: DefaultICUTokenizerConfig

ICUTokenizer

public ICUTokenizer(Reader input,
                    ICUTokenizerConfig config)

Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

Parameters:: input - Reader containing text to tokenize.; config - Tailored BreakIterator configuration

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Specified by:: incrementToken in class TokenStream

Throws:: IOException

reset

public void reset()
           throws IOException

Overrides:: reset in class TokenStream

Throws:: IOException

end

public void end()

Overrides:: end in class TokenStream

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.icu.segmentation Class ICUTokenizer

ICUTokenizer

ICUTokenizer

incrementToken

reset

end

org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizer