Class DefaultICUTokenizerConfig
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
Default
ICUTokenizerConfig
that is generally applicable to many languages.
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)
), but with
the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Token type for words that appear to be emoji sequencesstatic final String
Token type for words containing Korean hangulstatic final String
Token type for words containing Japanese hiraganastatic final String
Token type for words containing ideographic charactersstatic final String
Token type for words containing Japanese katakanastatic final String
Token type for words that contain lettersstatic final String
Token type for words that appear to be numbersFields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
-
Constructor Summary
ConstructorDescriptionDefaultICUTokenizerConfig
(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. -
Method Summary
Modifier and TypeMethodDescriptionboolean
true if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIterator
getBreakIterator
(int script) Return a breakiterator capable of processing a given script.getType
(int script, int ruleStatus) Return a token type value for a given script and BreakIterator rule status.
-
Field Details
-
WORD_IDEO
Token type for words containing ideographic characters -
WORD_HIRAGANA
Token type for words containing Japanese hiragana -
WORD_KATAKANA
Token type for words containing Japanese katakana -
WORD_HANGUL
Token type for words containing Korean hangul -
WORD_LETTER
Token type for words that contain letters -
WORD_NUMBER
Token type for words that appear to be numbers -
WORD_EMOJI
Token type for words that appear to be emoji sequences
-
-
Constructor Details
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords
- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords
- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Details
-
combineCJ
public boolean combineCJ()Description copied from class:ICUTokenizerConfig
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJ
in classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script) Description copied from class:ICUTokenizerConfig
Return a breakiterator capable of processing a given script.- Specified by:
getBreakIterator
in classICUTokenizerConfig
-
getType
Description copied from class:ICUTokenizerConfig
Return a token type value for a given script and BreakIterator rule status.- Specified by:
getType
in classICUTokenizerConfig
-