org.apache.lucene.analysis.pattern.PatternTokenizerFactory

public class PatternTokenizerFactory extends TokenizerFactory

Factory for PatternTokenizer. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

"pattern" is the regular expression.
"group" says which group to extract into tokens.

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = \'([^\']+)\'
  group = 0
  input = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

 <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/>
   </analyzer>
 </fieldType>

Since:

solr1.2

See Also:

PatternTokenizer

SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).

"pattern"

Field Summary

Fields

Modifier and Type

Field

Description

protected final int

group

static final String

GROUP

static final String

NAME

SPI name

protected final Pattern

pattern

static final String

PATTERN

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor Summary

Constructors

Constructor

Description

PatternTokenizerFactory()

Default ctor for compatibility with SPI

PatternTokenizerFactory(Map<String,String> args)

Creates a new PatternTokenizerFactory
Method Summary

Modifier and Type

Method

Description

PatternTokenizer

create(AttributeFactory factory)

Split the input using configured pattern

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NAME
  
  public static final String NAME
  
  SPI name
  See Also:
  
  Constant Field Values
- PATTERN
  
  public static final String PATTERN
  See Also:
  
  Constant Field Values
- GROUP
  
  public static final String GROUP
  See Also:
  
  Constant Field Values
- pattern
  
  protected final Pattern pattern
- group
  
  protected final int group
Constructor Details
- PatternTokenizerFactory
  
  public PatternTokenizerFactory(Map<String,String> args)
  
  Creates a new PatternTokenizerFactory
- PatternTokenizerFactory
  
  public PatternTokenizerFactory()
  
  Default ctor for compatibility with SPI
Method Details
- create
  
  public PatternTokenizer create(AttributeFactory factory)
  
  Split the input using configured pattern
  
  Specified by:
  
  create in class TokenizerFactory

Class PatternTokenizerFactory

Field Summary

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Methods inherited from class java.lang.Object

Field Details

NAME

PATTERN

GROUP

pattern

group

Constructor Details

PatternTokenizerFactory

PatternTokenizerFactory

Method Details

create