Analyzer for Simplified Chinese, which indexes words.
See:
Description
Class Summary |
AnalyzerProfile |
Manages analysis data configuration for SmartChineseAnalyzer |
CharType |
Internal SmartChineseAnalyzer character type constants. |
SentenceTokenizer |
Tokenizes input text into sentences. |
SmartChineseAnalyzer |
SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text. |
Utility |
SmartChineseAnalyzer utility constants and methods |
WordTokenFilter |
A TokenFilter that breaks sentences into words. |
WordType |
Internal SmartChineseAnalyzer token type constants |
Package org.apache.lucene.analysis.cn.smart Description
Analyzer for Simplified Chinese, which indexes words.
WARNING: The status of the analyzers/smartcn analysis.cn.smart package is experimental. The APIs
and file formats introduced here might change in the future and will not be supported anymore
in such a case.
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
- ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
- CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
- SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: "我是中国人"
- ChineseAnalyzer: 我-是-中-国-人
- CJKAnalyzer: 我是-是中-中国-国人
- SmartChineseAnalyzer: 我-是-中国-人
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.