Class CJKBigramFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
CJK types are set by these tokenizers, but you can also use CJKBigramFilter(TokenStream, int)
to explicitly control which of the CJK scripts are turned
into bigrams.
By default, when a CJK character has no adjacent characters to form a bigram, it is output in
unigram form. If you want to always output both unigrams and bigrams, set the
outputUnigrams
flag in CJKBigramFilter(TokenStream, int, boolean)
.
This can be used for a combined unigram+bigram approach.
Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries. Korean Hangul
characters are treated the same as many other scripts' letters, and as a result,
StandardTokenizer can produce tokens that mix Hangul and non-Hangul characters, e.g. "한국abc".
Such mixed-script tokens are typed as <ALPHANUM>
rather than
<HANGUL>
, and as a result, will not be converted to bigrams by CJKBigramFilter.
In all cases, all non-CJK input is passed thru unmodified.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
when we emit a bigram, it's then marked as this typestatic final int
bigram flag for Han Ideographsstatic final int
bigram flag for Hangulstatic final int
bigram flag for Hiraganastatic final int
bigram flag for Katakanastatic final String
when we emit a unigram, it's then marked as this typeFields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionCJKBigramFilter
(TokenStream in, int flags) CJKBigramFilter
(TokenStream in, int flags, boolean outputUnigrams) Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output. -
Method Summary
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
HAN
public static final int HANbigram flag for Han Ideographs- See Also:
-
HIRAGANA
public static final int HIRAGANAbigram flag for Hiragana- See Also:
-
KATAKANA
public static final int KATAKANAbigram flag for Katakana- See Also:
-
HANGUL
public static final int HANGULbigram flag for Hangul- See Also:
-
DOUBLE_TYPE
when we emit a bigram, it's then marked as this type- See Also:
-
SINGLE_TYPE
when we emit a unigram, it's then marked as this type- See Also:
-
-
Constructor Details
-
CJKBigramFilter
-
CJKBigramFilter
-
CJKBigramFilter
Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output.
-
-
Method Details
-
incrementToken
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-