Class ICUTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
- All Implemented Interfaces:
ResourceLoaderAware
Factory for
ICUTokenizer
. Words are broken across script boundaries, then segmented
according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig
.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated
list of code:rulefile
pairs in the following format: four-letter ISO 15924 script code,
followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn")
and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
- SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
- "icu"
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
Constructor Summary
ConstructorDescriptionDefault ctor for compatibility with SPIICUTokenizerFactory
(Map<String, String> args) Creates a new ICUTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate
(AttributeFactory factory) void
inform
(ResourceLoader loader) Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
-
Constructor Details
-
ICUTokenizerFactory
Creates a new ICUTokenizerFactory -
ICUTokenizerFactory
public ICUTokenizerFactory()Default ctor for compatibility with SPI
-
-
Method Details
-
inform
- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
IOException
-
create
- Specified by:
create
in classTokenizerFactory
-