public final class ICUTokenizer
extends org.apache.lucene.analysis.Tokenizer
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the ICUTokenizerConfig
ICUTokenizerConfig
Constructor and Description |
---|
ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given
Reader.
|
ICUTokenizer(Reader input,
ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
reset(Reader input) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
public ICUTokenizer(Reader input)
The default script-specific handling is used.
input
- Reader containing text to tokenize.DefaultICUTokenizerConfig
public ICUTokenizer(Reader input, ICUTokenizerConfig config)
input
- Reader containing text to tokenize.config
- Tailored BreakIterator configurationpublic boolean incrementToken() throws IOException
incrementToken
in class org.apache.lucene.analysis.TokenStream
IOException
public void reset() throws IOException
reset
in class org.apache.lucene.analysis.TokenStream
IOException
public void reset(Reader input) throws IOException
reset
in class org.apache.lucene.analysis.Tokenizer
IOException
public void end() throws IOException
end
in class org.apache.lucene.analysis.TokenStream
IOException