Class ICUTokenizerConfig
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
- Direct Known Subclasses:
DefaultICUTokenizerConfig
Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Rule status for emoji sequences -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionabstract boolean
true if Han, Hiragana, and Katakana scripts should all be returned as Japaneseabstract com.ibm.icu.text.RuleBasedBreakIterator
getBreakIterator
(int script) Return a breakiterator capable of processing a given script.abstract String
getType
(int script, int ruleStatus) Return a token type value for a given script and BreakIterator rule status.
-
Field Details
-
EMOJI_SEQUENCE_STATUS
public static final int EMOJI_SEQUENCE_STATUSRule status for emoji sequences- See Also:
-
-
Constructor Details
-
ICUTokenizerConfig
public ICUTokenizerConfig()Sole constructor. (For invocation by subclass constructors, typically implicit.)
-
-
Method Details
-
getBreakIterator
public abstract com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script) Return a breakiterator capable of processing a given script. -
getType
Return a token type value for a given script and BreakIterator rule status. -
combineCJ
public abstract boolean combineCJ()true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
-