public final class ICUTokenizer extends Tokenizer
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the ICUTokenizerConfig
ICUTokenizerConfigAttributeSource.AttributeFactory, AttributeSource.State| Constructor and Description |
|---|
ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given
Reader.
|
ICUTokenizer(Reader input,
ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
setReader(Reader input) |
close, correctOffsetaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreStatepublic ICUTokenizer(Reader input)
The default script-specific handling is used.
input - Reader containing text to tokenize.DefaultICUTokenizerConfigpublic ICUTokenizer(Reader input, ICUTokenizerConfig config)
input - Reader containing text to tokenize.config - Tailored BreakIterator configurationpublic boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionpublic void reset()
throws IOException
reset in class TokenStreamIOExceptionpublic void setReader(Reader input) throws IOException
setReader in class TokenizerIOExceptionpublic void end()
end in class TokenStreamCopyright © 2000-2012 Apache Software Foundation. All Rights Reserved.