Class SimplePatternSplitTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory
-
public class SimplePatternSplitTokenizerFactory extends TokenizerFactory
Factory forSimplePatternSplitTokenizer
, for producing tokens by splitting according to the provided regexp.This tokenizer uses Lucene
RegExp
pattern matching to construct distinct tokens for the input stream. The syntax is more limited thanPatternTokenizer
, but the tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp
- "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp
The pattern matches the characters that should split tokens, like
String.split
, and the matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> </analyzer> </fieldType>
- Since:
- 6.5.0
- See Also:
SimplePatternSplitTokenizer
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- "pattern" (required) is the regular expression, according to the syntax described at
-
-
Field Summary
Fields Modifier and Type Field Description static String
PATTERN
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description SimplePatternSplitTokenizerFactory(Map<String,String> args)
Creates a new SimpleSplitPatternTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SimplePatternSplitTokenizer
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
PATTERN
public static final String PATTERN
- See Also:
- Constant Field Values
-
-
Method Detail
-
create
public SimplePatternSplitTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-
-