org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory

All Implemented Interfaces:: ResourceLoaderAware

public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware

Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.

To use the default set of per-script rules:

 <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"/>
   </analyzer>
 </fieldType>

You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):

 <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
   </analyzer>
 </fieldType>

Since:: 3.1
SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).: "icu"

Field Summary

Fields

Modifier and Type

Field

Description

static final String

NAME

SPI name

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor Summary

Constructors

Constructor

Description

ICUTokenizerFactory()

Default ctor for compatibility with SPI

ICUTokenizerFactory(Map<String,String> args)

Creates a new ICUTokenizerFactory
Method Summary

Modifier and Type

Method

Description

ICUTokenizer

create(AttributeFactory factory)

void

inform(ResourceLoader loader)

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NAME
  
  public static final String NAME
  
  SPI name
  See Also:
  
  Constant Field Values
Constructor Details
- ICUTokenizerFactory
  
  public ICUTokenizerFactory(Map<String,String> args)
  
  Creates a new ICUTokenizerFactory
- ICUTokenizerFactory
  
  public ICUTokenizerFactory()
  
  Default ctor for compatibility with SPI
Method Details
- inform
  
  public void inform(ResourceLoader loader) throws IOException
  
  Specified by:
  
  inform in interface ResourceLoaderAware
  
  Throws:
  
  IOException
- create
  
  public ICUTokenizer create(AttributeFactory factory)
  
  Specified by:
  
  create in class TokenizerFactory

Class ICUTokenizerFactory

Field Summary

Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenizerFactory

Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

Methods inherited from class java.lang.Object

Field Details

NAME

Constructor Details

ICUTokenizerFactory

ICUTokenizerFactory

Method Details

inform

create