Class DefaultICUTokenizerConfig


  • public class DefaultICUTokenizerConfig
    extends ICUTokenizerConfig
    Default ICUTokenizerConfig that is generally applicable to many languages.

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Detail

      • WORD_IDEO

        public static final String WORD_IDEO
        Token type for words containing ideographic characters
      • WORD_HIRAGANA

        public static final String WORD_HIRAGANA
        Token type for words containing Japanese hiragana
      • WORD_KATAKANA

        public static final String WORD_KATAKANA
        Token type for words containing Japanese katakana
      • WORD_HANGUL

        public static final String WORD_HANGUL
        Token type for words containing Korean hangul
      • WORD_LETTER

        public static final String WORD_LETTER
        Token type for words that contain letters
      • WORD_NUMBER

        public static final String WORD_NUMBER
        Token type for words that appear to be numbers
      • WORD_EMOJI

        public static final String WORD_EMOJI
        Token type for words that appear to be emoji sequences
    • Constructor Detail

      • DefaultICUTokenizerConfig

        public DefaultICUTokenizerConfig​(boolean cjkAsWords,
                                         boolean myanmarAsWords)
        Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
        Parameters:
        cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
        myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
    • Method Detail

      • getBreakIterator

        public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator​(int script)
        Description copied from class: ICUTokenizerConfig
        Return a breakiterator capable of processing a given script.
        Specified by:
        getBreakIterator in class ICUTokenizerConfig
      • getType

        public String getType​(int script,
                              int ruleStatus)
        Description copied from class: ICUTokenizerConfig
        Return a token type value for a given script and BreakIterator rule status.
        Specified by:
        getType in class ICUTokenizerConfig