Class SimplePatternTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.TokenizerFactory
-
- org.apache.lucene.analysis.pattern.SimplePatternTokenizerFactory
-
public class SimplePatternTokenizerFactory extends TokenizerFactory
Factory forSimplePatternTokenizer
, for matching tokens based on the provided regexp.This tokenizer uses Lucene
RegExp
pattern matching to construct distinct tokens for the input stream. The syntax is more limited thanPatternTokenizer
, but the tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp
- "determinizeWorkLimit" (optional, default 10000) the limit on total effort spent to determinize the automaton computed from the regexp
The pattern matches the characters to include in a token (not the split characters), and the matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> </analyzer> </fieldType>
- Since:
- 6.5.0
- See Also:
SimplePatternTokenizer
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
- "simplePattern"
- "pattern" (required) is the regular expression, according to the syntax described at
-
-
Field Summary
Fields Modifier and Type Field Description static String
NAME
SPI namestatic String
PATTERN
-
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description SimplePatternTokenizerFactory()
Default ctor for compatibility with SPISimplePatternTokenizerFactory(Map<String,String> args)
Creates a new SimplePatternTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SimplePatternTokenizer
create(AttributeFactory factory)
-
Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final String NAME
SPI name- See Also:
- Constant Field Values
-
PATTERN
public static final String PATTERN
- See Also:
- Constant Field Values
-
-
Method Detail
-
create
public SimplePatternTokenizer create(AttributeFactory factory)
- Specified by:
create
in classTokenizerFactory
-
-