Package org.apache.lucene.analysis.cjk

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams.


Class Summary
CJKAnalyzer An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
CJKBigramFilter Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
CJKBigramFilterFactory Factory for CJKBigramFilter.
CJKTokenizer Deprecated. Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
CJKTokenizerFactory Deprecated. Use CJKBigramFilterFactory instead.
CJKWidthFilter A TokenFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana
CJKWidthFilterFactory Factory for CJKWidthFilter.

Package org.apache.lucene.analysis.cjk Description

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams. This analyzer generates bigram terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul characters.

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

Example phrase: "我是中国人"
  1. ChineseAnalyzer: 我-是-中-国-人
  2. CJKAnalyzer: 我是-是中-中国-国人
  3. SmartChineseAnalyzer: 我-是-中国-人

Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.