Class PatternTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.TokenizerFactory
-
- org.apache.lucene.analysis.pattern.PatternTokenizerFactory
-
public class PatternTokenizerFactory extends TokenizerFactory
Factory forPatternTokenizer
. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".
- "pattern" is the regular expression.
- "group" says which group to extract into tokens.
group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens):
String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)NOTE: This Tokenizer does not output tokens that are of zero length.
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/> </analyzer> </fieldType>
- Since:
- solr1.2
- See Also:
PatternTokenizer
- SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
- "pattern"
-
-
Field Summary
Fields Modifier and Type Field Description protected int
group
static String
GROUP
static String
NAME
SPI nameprotected Pattern
pattern
static String
PATTERN
-
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description PatternTokenizerFactory()
Default ctor for compatibility with SPIPatternTokenizerFactory(Map<String,String> args)
Creates a new PatternTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description PatternTokenizer
create(AttributeFactory factory)
Split the input using configured pattern-
Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final String NAME
SPI name- See Also:
- Constant Field Values
-
PATTERN
public static final String PATTERN
- See Also:
- Constant Field Values
-
GROUP
public static final String GROUP
- See Also:
- Constant Field Values
-
pattern
protected final Pattern pattern
-
group
protected final int group
-
-
Method Detail
-
create
public PatternTokenizer create(AttributeFactory factory)
Split the input using configured pattern- Specified by:
create
in classTokenizerFactory
-
-