Class DefaultICUTokenizerConfig
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
-
- org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
-
public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
DefaultICUTokenizerConfig
that is generally applicable to many languages.Generally tokenizes Unicode text according to UAX#29 (
BreakIterator.getWordInstance(ULocale.ROOT)
), but with the following tailorings:- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Field Summary
Fields Modifier and Type Field Description static String
WORD_EMOJI
Token type for words that appear to be emoji sequencesstatic String
WORD_HANGUL
Token type for words containing Korean hangulstatic String
WORD_HIRAGANA
Token type for words containing Japanese hiraganastatic String
WORD_IDEO
Token type for words containing ideographic charactersstatic String
WORD_KATAKANA
Token type for words containing Japanese katakanastatic String
WORD_LETTER
Token type for words that contain lettersstatic String
WORD_NUMBER
Token type for words that appear to be numbers-
Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
-
-
Constructor Summary
Constructors Constructor Description DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
Creates a new config.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIterator
getBreakIterator(int script)
Return a breakiterator capable of processing a given script.String
getType(int script, int ruleStatus)
Return a token type value for a given script and BreakIterator rule status.
-
-
-
Field Detail
-
WORD_IDEO
public static final String WORD_IDEO
Token type for words containing ideographic characters
-
WORD_HIRAGANA
public static final String WORD_HIRAGANA
Token type for words containing Japanese hiragana
-
WORD_KATAKANA
public static final String WORD_KATAKANA
Token type for words containing Japanese katakana
-
WORD_HANGUL
public static final String WORD_HANGUL
Token type for words containing Korean hangul
-
WORD_LETTER
public static final String WORD_LETTER
Token type for words that contain letters
-
WORD_NUMBER
public static final String WORD_NUMBER
Token type for words that appear to be numbers
-
WORD_EMOJI
public static final String WORD_EMOJI
Token type for words that appear to be emoji sequences
-
-
Constructor Detail
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords
- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords
- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Detail
-
combineCJ
public boolean combineCJ()
Description copied from class:ICUTokenizerConfig
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJ
in classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
Description copied from class:ICUTokenizerConfig
Return a breakiterator capable of processing a given script.- Specified by:
getBreakIterator
in classICUTokenizerConfig
-
getType
public String getType(int script, int ruleStatus)
Description copied from class:ICUTokenizerConfig
Return a token type value for a given script and BreakIterator rule status.- Specified by:
getType
in classICUTokenizerConfig
-
-