Class ICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- Direct Known Subclasses:
DefaultICUTokenizerConfig
public abstract class ICUTokenizerConfig extends Object
Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description static int
EMOJI_SEQUENCE_STATUS
Rule status for emoji sequences
-
Constructor Summary
Constructors Constructor Description ICUTokenizerConfig()
Sole constructor.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description abstract boolean
combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japaneseabstract com.ibm.icu.text.RuleBasedBreakIterator
getBreakIterator(int script)
Return a breakiterator capable of processing a given script.abstract String
getType(int script, int ruleStatus)
Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
EMOJI_SEQUENCE_STATUS
public static final int EMOJI_SEQUENCE_STATUS
Rule status for emoji sequences- See Also:
- Constant Field Values
-
-
Method Detail
-
getBreakIterator
public abstract com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
Return a breakiterator capable of processing a given script.
-
getType
public abstract String getType(int script, int ruleStatus)
Return a token type value for a given script and BreakIterator rule status.
-
combineCJ
public abstract boolean combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
-
-