Class SimplePatternTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.pattern.SimplePatternTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
This tokenizer uses a Lucene
RegExp
or (expert usage) a pre-built determinized Automaton
, to locate tokens. The regexp syntax is more limited than PatternTokenizer
,
but the tokenization is quite a bit faster. The provided regex should match valid token
characters (not token separator characters, like String.split
). The matching is greedy:
the longest match at a given start point will be the next token. Empty string tokens are never
produced.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionSimplePatternTokenizer
(String regexp) SeeRegExp
for the accepted syntax.SimplePatternTokenizer
(AttributeFactory factory, String regexp, int determinizeWorkLimit) SeeRegExp
for the accepted syntax.SimplePatternTokenizer
(AttributeFactory factory, Automaton dfa) Runs a pre-built automaton.Runs a pre-built automaton. -
Method Summary
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-