that tokenizes text with
normalizes content with
, folds case with
, forms bigrams of CJK with
and filters stopwords with
Forms bigrams of CJK terms that are generated from StandardTokenizer
Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
TokenFilter that normalizes CJK width differences:
Folds fullwidth ASCII variants into the equivalent basic latin
Folds halfwidth Katakana variants into the equivalent kana
NOTE: this filter can be viewed as a (practical) subset of NFKC/NFKD
Package org.apache.lucene.analysis.cjk Description
Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters).
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
- ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
- CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
- SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase： "我是中国人"
- ChineseAnalyzer: 我－是－中－国－人
- CJKAnalyzer: 我是－是中－中国－国人
- SmartChineseAnalyzer: 我－是－中国－人