Class DefaultICUTokenizerConfig

java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
Default ICUTokenizerConfig that is generally applicable to many languages.

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

  • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Details

    • WORD_IDEO

      public static final String WORD_IDEO
      Token type for words containing ideographic characters
    • WORD_HIRAGANA

      public static final String WORD_HIRAGANA
      Token type for words containing Japanese hiragana
    • WORD_KATAKANA

      public static final String WORD_KATAKANA
      Token type for words containing Japanese katakana
    • WORD_HANGUL

      public static final String WORD_HANGUL
      Token type for words containing Korean hangul
    • WORD_LETTER

      public static final String WORD_LETTER
      Token type for words that contain letters
    • WORD_NUMBER

      public static final String WORD_NUMBER
      Token type for words that appear to be numbers
    • WORD_EMOJI

      public static final String WORD_EMOJI
      Token type for words that appear to be emoji sequences
  • Constructor Details

    • DefaultICUTokenizerConfig

      public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
      Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
      Parameters:
      cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
      myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
  • Method Details

    • combineCJ

      public boolean combineCJ()
      Description copied from class: ICUTokenizerConfig
      true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
      Specified by:
      combineCJ in class ICUTokenizerConfig
    • getBreakIterator

      public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
      Description copied from class: ICUTokenizerConfig
      Return a breakiterator capable of processing a given script.
      Specified by:
      getBreakIterator in class ICUTokenizerConfig
    • getType

      public String getType(int script, int ruleStatus)
      Description copied from class: ICUTokenizerConfig
      Return a token type value for a given script and BreakIterator rule status.
      Specified by:
      getType in class ICUTokenizerConfig