org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
All Implemented Interfaces:
Closeable

public final class ICUTokenizer
extends Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:
ICUTokenizerConfig
WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
ICUTokenizer(Reader input)
          Construct a new ICUTokenizer that breaks text into words from the given Reader.
ICUTokenizer(Reader input, ICUTokenizerConfig config)
          Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
 
Method Summary
 void end()
           
 boolean incrementToken()
           
 void reset()
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ICUTokenizer

public ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given Reader.

The default script-specific handling is used.

Parameters:
input - Reader containing text to tokenize.
See Also:
DefaultICUTokenizerConfig

ICUTokenizer

public ICUTokenizer(Reader input,
                    ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

Parameters:
input - Reader containing text to tokenize.
config - Tailored BreakIterator configuration
Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException
Specified by:
incrementToken in class TokenStream
Throws:
IOException

reset

public void reset()
           throws IOException
Overrides:
reset in class TokenStream
Throws:
IOException

end

public void end()
Overrides:
end in class TokenStream


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.