org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizerFactory

java.lang.Object
  extended by org.apache.lucene.analysis.util.AbstractAnalysisFactory
      extended by org.apache.lucene.analysis.util.TokenizerFactory
          extended by org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
All Implemented Interfaces:
ResourceLoaderAware

public class ICUTokenizerFactory
extends TokenizerFactory
implements ResourceLoaderAware

Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.

To use the default set of per-script rules:

 <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"/>
   </analyzer>
 </fieldType>

You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference. To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):

 <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"
                rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
   </analyzer>
 </fieldType>


Field Summary
 
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
args, luceneMatchVersion
 
Constructor Summary
ICUTokenizerFactory()
          Sole constructor.
 
Method Summary
 Tokenizer create(Reader input)
           
 void inform(ResourceLoader loader)
           
 void init(Map<String,String> args)
           
 
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, forName, lookupClass, reloadTokenizers
 
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
assureMatchVersion, getArgs, getBoolean, getBoolean, getInt, getInt, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSnowballWordSet, getWordSet, setLuceneMatchVersion, splitFileNames
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ICUTokenizerFactory

public ICUTokenizerFactory()
Sole constructor. See AbstractAnalysisFactory for initialization lifecycle.

Method Detail

init

public void init(Map<String,String> args)
Overrides:
init in class AbstractAnalysisFactory

inform

public void inform(ResourceLoader loader)
            throws IOException
Specified by:
inform in interface ResourceLoaderAware
Throws:
IOException

create

public Tokenizer create(Reader input)
Specified by:
create in class TokenizerFactory


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.