org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.util.AbstractAnalysisFactory
org.apache.lucene.analysis.util.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
- All Implemented Interfaces:
- ResourceLoaderAware
public class ICUTokenizerFactory
- extends TokenizerFactory
- implements ResourceLoaderAware
Factory for ICUTokenizer
.
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the DefaultICUTokenizerConfig
.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files,
which are compiled by the ICU RuleBasedBreakIterator. See the
ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a
comma-separated list of code:rulefile pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory |
assureMatchVersion, getArgs, getBoolean, getBoolean, getInt, getInt, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSnowballWordSet, getWordSet, setLuceneMatchVersion, splitFileNames |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ICUTokenizerFactory
public ICUTokenizerFactory()
- Sole constructor. See
AbstractAnalysisFactory
for initialization lifecycle.
init
public void init(Map<String,String> args)
- Overrides:
init
in class AbstractAnalysisFactory
inform
public void inform(ResourceLoader loader)
throws IOException
- Specified by:
inform
in interface ResourceLoaderAware
- Throws:
IOException
create
public Tokenizer create(Reader input)
- Specified by:
create
in class TokenizerFactory
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.