org.apache.lucene.analysis.cjk (Lucene 9.1.0 common API)

package org.apache.lucene.analysis.cjk

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams. This analyzer generates bigram terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul characters.

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token.
CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens.

Example phrase： "我是中国人"

ChineseAnalyzer: 我－是－中－国－人
CJKAnalyzer: 我是－是中－中国－国人
SmartChineseAnalyzer: 我－是－中国－人

Classes

Class

Description

CJKAnalyzer

An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

CJKBigramFilter

Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.

CJKBigramFilterFactory

Factory for CJKBigramFilter.

CJKWidthCharFilter

A CharFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana

CJKWidthCharFilterFactory

Factory for CJKWidthCharFilter.

CJKWidthFilter

A TokenFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana

CJKWidthFilterFactory

Factory for CJKWidthFilter.

Package org.apache.lucene.analysis.cjk