org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory

public class SimplePatternSplitTokenizerFactory extends TokenizerFactory

Factory for SimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.

This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:

"pattern" (required) is the regular expression, according to the syntax described at RegExp
"determinizeWorkLimit" (optional, default Operations.DEFAULT_DETERMINIZE_WORK_LIMIT) the limit on total effort to determinize the automaton computed from the regexp

The pattern matches the characters that should split tokens, like String.split, and the matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

For example, to match tokens delimited by simple whitespace characters:

 <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
   </analyzer>
 </fieldType>

Since:

6.5.0

See Also:

SimplePatternSplitTokenizer

WARNING: This API is experimental and might change in incompatible ways in the next release.

SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).

"simplePatternSplit"

Field Summary

Fields

Modifier and Type

Field

Description

static final String

NAME

SPI name

static final String

PATTERN

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor Summary

Constructors

Constructor

Description

SimplePatternSplitTokenizerFactory()

Default ctor for compatibility with SPI

SimplePatternSplitTokenizerFactory(Map<String,String> args)

Creates a new SimpleSplitPatternTokenizerFactory
Method Summary

Modifier and Type

Method

Description

SimplePatternSplitTokenizer

create(AttributeFactory factory)

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NAME
  
  public static final String NAME
  
  SPI name
  See Also:
  
  Constant Field Values
- PATTERN
  
  public static final String PATTERN
  See Also:
  
  Constant Field Values
Constructor Details
- SimplePatternSplitTokenizerFactory
  
  public SimplePatternSplitTokenizerFactory(Map<String,String> args)
  
  Creates a new SimpleSplitPatternTokenizerFactory
- SimplePatternSplitTokenizerFactory
  
  public SimplePatternSplitTokenizerFactory()
  
  Default ctor for compatibility with SPI
Method Details
- create
  
  public SimplePatternSplitTokenizer create(AttributeFactory factory)
  
  Specified by:
  
  create in class TokenizerFactory

Class SimplePatternSplitTokenizerFactory

Field Summary

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Methods inherited from class java.lang.Object

Field Details

NAME

PATTERN

Constructor Details

SimplePatternSplitTokenizerFactory

SimplePatternSplitTokenizerFactory

Method Details

create