org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
              extended by org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
All Implemented Interfaces:
Closeable

public final class ICUTokenizer
extends org.apache.lucene.analysis.Tokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:
ICUTokenizerConfig
WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
ICUTokenizer(Reader input)
          Construct a new ICUTokenizer that breaks text into words from the given Reader.
ICUTokenizer(Reader input, ICUTokenizerConfig config)
          Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
 
Method Summary
 void end()
           
 boolean incrementToken()
           
 void reset()
           
 void reset(Reader input)
           
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ICUTokenizer

public ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given Reader.

The default script-specific handling is used.

Parameters:
input - Reader containing text to tokenize.
See Also:
DefaultICUTokenizerConfig

ICUTokenizer

public ICUTokenizer(Reader input,
                    ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

Parameters:
input - Reader containing text to tokenize.
config - Tailored BreakIterator configuration
Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException
Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

reset

public void reset()
           throws IOException
Overrides:
reset in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

reset

public void reset(Reader input)
           throws IOException
Overrides:
reset in class org.apache.lucene.analysis.Tokenizer
Throws:
IOException

end

public void end()
         throws IOException
Overrides:
end in class org.apache.lucene.analysis.TokenStream
Throws:
IOException


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.