java.lang.Object

org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig

org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

public class DefaultICUTokenizerConfig extends ICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

static final String

WORD_EMOJI

Token type for words that appear to be emoji sequences

static final String

WORD_HANGUL

Token type for words containing Korean hangul

static final String

WORD_HIRAGANA

Token type for words containing Japanese hiragana

static final String

WORD_IDEO

Token type for words containing ideographic characters

static final String

WORD_KATAKANA

Token type for words containing Japanese katakana

static final String

WORD_LETTER

Token type for words that contain letters

static final String

WORD_NUMBER

Token type for words that appear to be numbers

Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
Constructor Summary

Constructors

Constructor

Description

DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)

Creates a new config.
Method Summary

Modifier and Type

Method

Description

boolean

combineCJ()

true if Han, Hiragana, and Katakana scripts should all be returned as Japanese

com.ibm.icu.text.RuleBasedBreakIterator

getBreakIterator(int script)

Return a breakiterator capable of processing a given script.

String

getType(int script, int ruleStatus)

Return a token type value for a given script and BreakIterator rule status.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- WORD_IDEO
  
  public static final String WORD_IDEO
  
  Token type for words containing ideographic characters
- WORD_HIRAGANA
  
  public static final String WORD_HIRAGANA
  
  Token type for words containing Japanese hiragana
- WORD_KATAKANA
  
  public static final String WORD_KATAKANA
  
  Token type for words containing Japanese katakana
- WORD_HANGUL
  
  public static final String WORD_HANGUL
  
  Token type for words containing Korean hangul
- WORD_LETTER
  
  public static final String WORD_LETTER
  
  Token type for words that contain letters
- WORD_NUMBER
  
  public static final String WORD_NUMBER
  
  Token type for words that appear to be numbers
- WORD_EMOJI
  
  public static final String WORD_EMOJI
  
  Token type for words that appear to be emoji sequences
Constructor Details
- DefaultICUTokenizerConfig
  
  public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
  
  Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
  
  Parameters:
  
  cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
  
  myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
Method Details
- combineCJ
  
  public boolean combineCJ()
  
  Description copied from class: ICUTokenizerConfig
  
  true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
  
  Specified by:
  
  combineCJ in class ICUTokenizerConfig
- getBreakIterator
  
  public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
  
  Description copied from class: ICUTokenizerConfig
  
  Return a breakiterator capable of processing a given script.
  
  Specified by:
  
  getBreakIterator in class ICUTokenizerConfig
- getType
  
  public String getType(int script, int ruleStatus)
  
  Description copied from class: ICUTokenizerConfig
  
  Return a token type value for a given script and BreakIterator rule status.
  
  Specified by:
  
  getType in class ICUTokenizerConfig

Class DefaultICUTokenizerConfig

Field Summary

Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

WORD_IDEO

WORD_HIRAGANA

WORD_KATAKANA

WORD_HANGUL

WORD_LETTER

WORD_NUMBER

WORD_EMOJI

Constructor Details

DefaultICUTokenizerConfig

Method Details

combineCJ

getBreakIterator

getType