public class SimplePatternSplitTokenizerFactory extends TokenizerFactory
SimplePatternSplitTokenizer
, for producing tokens by splitting according to the provided regexp.
This tokenizer uses Lucene RegExp
pattern matching to construct distinct tokens
for the input stream. The syntax is more limited than PatternTokenizer
, but the
tokenization is quite a bit faster. It takes two arguments:
RegExp
The pattern matches the characters that should split tokens, like String.split
, and the
matching is greedy such that the longest token separator matching at a given point is matched. Empty
tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> </analyzer> </fieldType>
SimplePatternSplitTokenizer
Modifier and Type | Field and Description |
---|---|
static String |
PATTERN |
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor and Description |
---|
SimplePatternSplitTokenizerFactory(Map<String,String> args)
Creates a new SimpleSplitPatternTokenizerFactory
|
Modifier and Type | Method and Description |
---|---|
SimplePatternSplitTokenizer |
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory
|
availableTokenizers, create, forName, lookupClass, reloadTokenizers
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
public static final String PATTERN
public SimplePatternSplitTokenizer create(AttributeFactory factory)
TokenizerFactory
create
in class TokenizerFactory
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.