Class PatternTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.pattern.PatternTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It
takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent
to the output from (without empty tokens): String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc'the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
- See Also:
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionPatternTokenizer
(Pattern pattern, int group) creates a new PatternTokenizer returning tokens from group (-1 for split functionality)PatternTokenizer
(AttributeFactory factory, Pattern pattern, int group) creates a new PatternTokenizer returning tokens from group (-1 for split functionality) -
Method Summary
Methods inherited from class org.apache.lucene.analysis.Tokenizer
correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
PatternTokenizer
creates a new PatternTokenizer returning tokens from group (-1 for split functionality) -
PatternTokenizer
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
-
-
Method Details
-
incrementToken
public boolean incrementToken()- Specified by:
incrementToken
in classTokenStream
-
end
- Overrides:
end
in classTokenStream
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classTokenizer
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenizer
- Throws:
IOException
-