Class ICUTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.TokenizerFactory
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forICUTokenizer
. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by theDefaultICUTokenizerConfig
.To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of
code:rulefile
pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
- SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
- "icu"
-
-
Field Summary
Fields Modifier and Type Field Description static String
NAME
SPI name-
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizerFactory()
Default ctor for compatibility with SPIICUTokenizerFactory(Map<String,String> args)
Creates a new ICUTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ICUTokenizer
create(AttributeFactory factory)
void
inform(ResourceLoader loader)
-
Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final String NAME
SPI name- See Also:
- Constant Field Values
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws IOException
- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
IOException
-
create
public ICUTokenizer create(AttributeFactory factory)
- Specified by:
create
in classTokenizerFactory
-
-