public final class SimplePatternSplitTokenizer extends Tokenizer
RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens.
The regexp syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. This is just
like SimplePatternTokenizer except that the pattern should make valid token separator characters, like
String.split. Empty string tokens are never produced.AttributeSource.StateDEFAULT_TOKEN_ATTRIBUTE_FACTORY| Constructor and Description |
|---|
SimplePatternSplitTokenizer(AttributeFactory factory,
Automaton dfa)
Runs a pre-built automaton.
|
SimplePatternSplitTokenizer(AttributeFactory factory,
String regexp,
int maxDeterminizedStates)
See
RegExp for the accepted syntax. |
SimplePatternSplitTokenizer(Automaton dfa)
Runs a pre-built automaton.
|
SimplePatternSplitTokenizer(String regexp)
See
RegExp for the accepted syntax. |
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
close, correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringpublic SimplePatternSplitTokenizer(String regexp)
RegExp for the accepted syntax.public SimplePatternSplitTokenizer(Automaton dfa)
public SimplePatternSplitTokenizer(AttributeFactory factory, String regexp, int maxDeterminizedStates)
RegExp for the accepted syntax.public SimplePatternSplitTokenizer(AttributeFactory factory, Automaton dfa)
public boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionpublic void end()
throws IOException
end in class TokenStreamIOExceptionpublic void reset()
throws IOException
reset in class TokenizerIOExceptionCopyright © 2000-2019 Apache Software Foundation. All Rights Reserved.