Analyzer for Simplified Chinese, which indexes words.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
- Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
- StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
- CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
- SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
- StandardAnalyzer: 我－是－中－国－人
- CJKAnalyzer: 我是－是中－中国－国人
- SmartChineseAnalyzer: 我－是－中国－人
ClassDescriptionManages analysis data configuration for SmartChineseAnalyzerInternal SmartChineseAnalyzer character type constants.Tokenizer for Chinese or mixed Chinese-English text.Factory for
HMMChineseTokenizerSmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.SmartChineseAnalyzer utility constants and methodsInternal SmartChineseAnalyzer token type constants