public class SimplePatternTokenizerFactory extends TokenizerFactory
SimplePatternTokenizer
, for matching tokens based on the provided regexp.
This tokenizer uses Lucene RegExp
pattern matching to construct distinct tokens
for the input stream. The syntax is more limited than PatternTokenizer
, but the
tokenization is quite a bit faster. It takes two arguments:
RegExp
The pattern matches the characters to include in a token (not the split characters), and the matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> </analyzer> </fieldType>
SimplePatternTokenizer
Modifier and Type | Field and Description |
---|---|
static String |
PATTERN |
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor and Description |
---|
SimplePatternTokenizerFactory(Map<String,String> args)
Creates a new SimplePatternTokenizerFactory
|
Modifier and Type | Method and Description |
---|---|
SimplePatternTokenizer |
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory
|
availableTokenizers, create, forName, lookupClass, reloadTokenizers
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
public static final String PATTERN
public SimplePatternTokenizer create(AttributeFactory factory)
TokenizerFactory
create
in class TokenizerFactory
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.