ICUTokenizerFactory (Lucene 4.2.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.icu.segmentation
Class ICUTokenizerFactory

java.lang.Object
  org.apache.lucene.analysis.util.AbstractAnalysisFactory
      org.apache.lucene.analysis.util.TokenizerFactory
          org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory

All Implemented Interfaces:: ResourceLoaderAware

public class ICUTokenizerFactory
extends TokenizerFactory
implements ResourceLoaderAware
extends TokenizerFactory
implements ResourceLoaderAware

Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.

To use the default set of per-script rules:

 <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"/>
   </analyzer>
 </fieldType>

You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference. To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):

 <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"
                rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
   </analyzer>
 </fieldType>

Field Summary

Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
`args, luceneMatchVersion`

Constructor Summary
`ICUTokenizerFactory()` Sole constructor.

Method Summary
`Tokenizer`	`create(Reader input)`
`void`	`inform(ResourceLoader loader)`
`void`	`init(Map<String,String> args)`

Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
`availableTokenizers, forName, lookupClass, reloadTokenizers`

Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
`assureMatchVersion, getArgs, getBoolean, getBoolean, getInt, getInt, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSnowballWordSet, getWordSet, setLuceneMatchVersion, splitFileNames`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail