DefaultICUTokenizerConfig (Lucene 3.6.2 API)

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method

java.lang.Object
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
- - org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

```
public class DefaultICUTokenizerConfig
extends ICUTokenizerConfig
```
Default ICUTokenizerConfig that is generally applicable to many languages.
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai text is broken into words with a DictionaryBasedBreakIterator
- Lao, Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.
- Hebrew text has custom tailorings to handle special cases involving punctuation.
WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`WORD_HANGUL` Token type for words containing Korean hangul
`static String`	`WORD_HIRAGANA` Token type for words containing Japanese hiragana
`static String`	`WORD_IDEO` Token type for words containing ideographic characters
`static String`	`WORD_KATAKANA` Token type for words containing Japanese katakana
`static String`	`WORD_LETTER` Token type for words that contain letters
`static String`	`WORD_NUMBER` Token type for words that appear to be numbers

Constructor Summary

Constructors
Constructor and Description

DefaultICUTokenizerConfig()

Method Summary

Methods
Modifier and Type	Method and Description
`com.ibm.icu.text.BreakIterator`	`getBreakIterator(int script)` Return a breakiterator capable of processing a given script.
`String`	`getType(int script, int ruleStatus)` Return a token type value for a given script and BreakIterator rule status.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - WORD_IDEO
```
public static final String WORD_IDEO
```
    Token type for words containing ideographic characters
  - WORD_HIRAGANA
```
public static final String WORD_HIRAGANA
```
    Token type for words containing Japanese hiragana
  - WORD_KATAKANA
```
public static final String WORD_KATAKANA
```
    Token type for words containing Japanese katakana
  - WORD_HANGUL
```
public static final String WORD_HANGUL
```
    Token type for words containing Korean hangul
  - WORD_LETTER
```
public static final String WORD_LETTER
```
    Token type for words that contain letters
  - WORD_NUMBER
```
public static final String WORD_NUMBER
```
    Token type for words that appear to be numbers
- Constructor Detail
  - DefaultICUTokenizerConfig
```
public DefaultICUTokenizerConfig()
```
- Method Detail
  - getBreakIterator
```
public com.ibm.icu.text.BreakIterator getBreakIterator(int script)
```
    Description copied from class: ICUTokenizerConfig
    
    Return a breakiterator capable of processing a given script.
    
    Specified by:
    
    getBreakIterator in class ICUTokenizerConfig
  - getType
```
public String getType(int script,
             int ruleStatus)
```
    Description copied from class: ICUTokenizerConfig
    
    Return a token type value for a given script and BreakIterator rule status.
    
    Specified by:
    
    getType in class ICUTokenizerConfig

Prev Class
Next Class

All Classes

Summary:
Nested |
Field |
Constr |
Method

Detail:
Field |
Constr |
Method