Class ICUTokenizerConfig

  • Direct Known Subclasses:
    DefaultICUTokenizerConfig

    public abstract class ICUTokenizerConfig
    extends Object
    Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Detail

      • EMOJI_SEQUENCE_STATUS

        public static final int EMOJI_SEQUENCE_STATUS
        Rule status for emoji sequences
        See Also:
        Constant Field Values
    • Constructor Detail

      • ICUTokenizerConfig

        public ICUTokenizerConfig()
        Sole constructor. (For invocation by subclass constructors, typically implicit.)
    • Method Detail

      • getBreakIterator

        public abstract com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator​(int script)
        Return a breakiterator capable of processing a given script.
      • getType

        public abstract String getType​(int script,
                                       int ruleStatus)
        Return a token type value for a given script and BreakIterator rule status.
      • combineCJ

        public abstract boolean combineCJ()
        true if Han, Hiragana, and Katakana scripts should all be returned as Japanese