Class ICUTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forICUTokenizer
. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by theDefaultICUTokenizerConfig
.To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizerFactory(Map<String,String> args)
Creates a new ICUTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ICUTokenizer
create(AttributeFactory factory)
void
inform(ResourceLoader loader)
-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws IOException
- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
IOException
-
create
public ICUTokenizer create(AttributeFactory factory)
- Specified by:
create
in classTokenizerFactory
-
-