Package org.apache.lucene.analysis.cn.smart

Analyzer for Simplified Chinese, which indexes words.

See:
          Description

Class Summary
AnalyzerProfile Manages analysis data configuration for SmartChineseAnalyzer
CharType Internal SmartChineseAnalyzer character type constants.
SentenceTokenizer Tokenizes input text into sentences.
SmartChineseAnalyzer SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.
SmartChineseSentenceTokenizerFactory Factory for the SmartChineseAnalyzer SentenceTokenizer
SmartChineseWordTokenFilterFactory Factory for the SmartChineseAnalyzer WordTokenFilter
Utility SmartChineseAnalyzer utility constants and methods
WordTokenFilter A TokenFilter that breaks sentences into words.
WordType Internal SmartChineseAnalyzer token type constants
 

Package org.apache.lucene.analysis.cn.smart Description

Analyzer for Simplified Chinese, which indexes words.

WARNING: This API is experimental and might change in incompatible ways in the next release.
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
  • StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
  • CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
  • SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: "我是中国人"
  1. StandardAnalyzer: 我-是-中-国-人
  2. CJKAnalyzer: 我是-是中-中国-国人
  3. SmartChineseAnalyzer: 我-是-中国-人


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.