Class SimplePatternTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.pattern.SimplePatternTokenizerFactory
-
public class SimplePatternTokenizerFactory extends TokenizerFactory
Factory forSimplePatternTokenizer
, for matching tokens based on the provided regexp.This tokenizer uses Lucene
RegExp
pattern matching to construct distinct tokens for the input stream. The syntax is more limited thanPatternTokenizer
, but the tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp
- "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp
The pattern matches the characters to include in a token (not the split characters), and the matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> </analyzer> </fieldType>
- Since:
- 6.5.0
- See Also:
SimplePatternTokenizer
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- SPI Name (Note: This is case-insensitive. e.g., if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service):
- "simplePattern"
- "pattern" (required) is the regular expression, according to the syntax described at
-
-
Field Summary
Fields Modifier and Type Field Description static String
NAME
SPI namestatic String
PATTERN
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description SimplePatternTokenizerFactory(Map<String,String> args)
Creates a new SimplePatternTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SimplePatternTokenizer
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final String NAME
SPI name- See Also:
- Constant Field Values
-
PATTERN
public static final String PATTERN
- See Also:
- Constant Field Values
-
-
Method Detail
-
create
public SimplePatternTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-
-