Class ICUTokenizerConfig

java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
Direct Known Subclasses:
DefaultICUTokenizerConfig

public abstract class ICUTokenizerConfig extends Object
Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Rule status for emoji sequences
  • Constructor Summary

    Constructors
    Constructor
    Description
    Sole constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    abstract boolean
    true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
    abstract com.ibm.icu.text.RuleBasedBreakIterator
    getBreakIterator(int script)
    Return a breakiterator capable of processing a given script.
    abstract String
    getType(int script, int ruleStatus)
    Return a token type value for a given script and BreakIterator rule status.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • EMOJI_SEQUENCE_STATUS

      public static final int EMOJI_SEQUENCE_STATUS
      Rule status for emoji sequences
      See Also:
  • Constructor Details

    • ICUTokenizerConfig

      public ICUTokenizerConfig()
      Sole constructor. (For invocation by subclass constructors, typically implicit.)
  • Method Details

    • getBreakIterator

      public abstract com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
      Return a breakiterator capable of processing a given script.
    • getType

      public abstract String getType(int script, int ruleStatus)
      Return a token type value for a given script and BreakIterator rule status.
    • combineCJ

      public abstract boolean combineCJ()
      true if Han, Hiragana, and Katakana scripts should all be returned as Japanese