CJKBigramFilter (Lucene 4.7.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.cjk
Class CJKBigramFilter

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.TokenFilter
              org.apache.lucene.analysis.cjk.CJKBigramFilter

All Implemented Interfaces:: Closeable

public final class CJKBigramFilter
extends TokenFilter
extends TokenFilter

Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.

CJK types are set by these tokenizers, but you can also use CJKBigramFilter(TokenStream, int) to explicitly control which of the CJK scripts are turned into bigrams.

By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams flag in CJKBigramFilter(TokenStream, int, boolean). This can be used for a combined unigram+bigram approach.

In all cases, all non-CJK input is passed thru unmodified.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static String`	`DOUBLE_TYPE` when we emit a bigram, its then marked as this type
`static int`	`HAN` bigram flag for Han Ideographs
`static int`	`HANGUL` bigram flag for Hangul
`static int`	`HIRAGANA` bigram flag for Hiragana
`static int`	`KATAKANA` bigram flag for Katakana
`static String`	`SINGLE_TYPE` when we emit a unigram, its then marked as this type

Fields inherited from class org.apache.lucene.analysis.TokenFilter
`input`

Constructor Summary
`CJKBigramFilter(TokenStream in)` Calls `CJKBigramFilter(in, HAN \| HIRAGANA \| KATAKANA \| HANGUL)`
`CJKBigramFilter(TokenStream in, int flags)` Calls `CJKBigramFilter(in, flags, false)`
`CJKBigramFilter(TokenStream in, int flags, boolean outputUnigrams)` Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output.

Method Summary
`boolean`	`incrementToken()`
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.TokenFilter
`close, end`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

HAN

public static final int HAN

bigram flag for Han Ideographs

See Also:: Constant Field Values

HIRAGANA

public static final int HIRAGANA

bigram flag for Hiragana

See Also:: Constant Field Values

KATAKANA

public static final int KATAKANA

bigram flag for Katakana

See Also:: Constant Field Values

HANGUL

public static final int HANGUL

bigram flag for Hangul

See Also:: Constant Field Values

DOUBLE_TYPE

public static final String DOUBLE_TYPE

when we emit a bigram, its then marked as this type

See Also:: Constant Field Values

SINGLE_TYPE

public static final String SINGLE_TYPE

when we emit a unigram, its then marked as this type

See Also:: Constant Field Values

Constructor Detail