Class ICUTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by theDefaultICUTokenizerConfig.To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>- Since:
- 3.1
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizerFactory(Map<String,String> args)Creates a new ICUTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ICUTokenizercreate(AttributeFactory factory)voidinform(ResourceLoader loader)-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws IOException
- Specified by:
informin interfaceResourceLoaderAware- Throws:
IOException
-
create
public ICUTokenizer create(AttributeFactory factory)
- Specified by:
createin classTokenizerFactory
-
-