public final class ICUTokenizer extends Tokenizer
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the ICUTokenizerConfig
ICUTokenizerConfig
AttributeSource.AttributeFactory, AttributeSource.State
Constructor and Description |
---|
ICUTokenizer(Reader input)
Construct a new ICUTokenizer that breaks text into words from the given
Reader.
|
ICUTokenizer(Reader input,
ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
public ICUTokenizer(Reader input)
The default script-specific handling is used.
input
- Reader containing text to tokenize.DefaultICUTokenizerConfig
public ICUTokenizer(Reader input, ICUTokenizerConfig config)
input
- Reader containing text to tokenize.config
- Tailored BreakIterator configurationpublic boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
public void reset() throws IOException
reset
in class TokenStream
IOException
public void end()
end
in class TokenStream
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.