org.apache.lucene.analysis.cjk.CJKBigramFilter

All Implemented Interfaces:: Closeable, AutoCloseable, Unwrappable<TokenStream>

public final class CJKBigramFilter extends TokenFilter

Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.

CJK types are set by these tokenizers, but you can also use CJKBigramFilter(TokenStream, int) to explicitly control which of the CJK scripts are turned into bigrams.

By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams flag in CJKBigramFilter(TokenStream, int, boolean). This can be used for a combined unigram+bigram approach.

Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries. Korean Hangul characters are treated the same as many other scripts' letters, and as a result, StandardTokenizer can produce tokens that mix Hangul and non-Hangul characters, e.g. "한국abc". Such mixed-script tokens are typed as <ALPHANUM> rather than <HANGUL>, and as a result, will not be converted to bigrams by CJKBigramFilter.

In all cases, all non-CJK input is passed thru unmodified.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields

Modifier and Type

Field

Description

static final String

DOUBLE_TYPE

when we emit a bigram, it's then marked as this type

static final int

HAN

bigram flag for Han Ideographs

static final int

HANGUL

bigram flag for Hangul

static final int

HIRAGANA

bigram flag for Hiragana

static final int

KATAKANA

bigram flag for Katakana

static final String

SINGLE_TYPE

when we emit a unigram, it's then marked as this type

Fields inherited from class org.apache.lucene.analysis.TokenFilter
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

CJKBigramFilter(TokenStream in)

Calls CJKBigramFilter(in, HAN | HIRAGANA | KATAKANA | HANGUL)

CJKBigramFilter(TokenStream in, int flags)

Calls CJKBigramFilter(in, flags, false)

CJKBigramFilter(TokenStream in, int flags, boolean outputUnigrams)

Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output.
Method Summary

Modifier and Type

Method

Description

boolean

incrementToken()

void

reset()

Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- HAN
  
  public static final int HAN
  
  bigram flag for Han Ideographs
  See Also:
  
  Constant Field Values
- HIRAGANA
  
  public static final int HIRAGANA
  
  bigram flag for Hiragana
  See Also:
  
  Constant Field Values
- KATAKANA
  
  public static final int KATAKANA
  
  bigram flag for Katakana
  See Also:
  
  Constant Field Values
- HANGUL
  
  public static final int HANGUL
  
  bigram flag for Hangul
  See Also:
  
  Constant Field Values
- DOUBLE_TYPE
  
  public static final String DOUBLE_TYPE
  
  when we emit a bigram, it's then marked as this type
  See Also:
  
  Constant Field Values
- SINGLE_TYPE
  
  public static final String SINGLE_TYPE
  
  when we emit a unigram, it's then marked as this type
  See Also:
  
  Constant Field Values
Constructor Details
- CJKBigramFilter
  
  public CJKBigramFilter(TokenStream in)
  
  Calls CJKBigramFilter(in, HAN | HIRAGANA | KATAKANA | HANGUL)
- CJKBigramFilter
  
  public CJKBigramFilter(TokenStream in, int flags)
  
  Calls CJKBigramFilter(in, flags, false)
- CJKBigramFilter
  
  public CJKBigramFilter(TokenStream in, int flags, boolean outputUnigrams)
  
  Create a new CJKBigramFilter, specifying which writing systems should be bigrammed, and whether or not unigrams should also be output.
  
  Parameters:
  
  flags - OR'ed set from HAN, HIRAGANA, KATAKANA, HANGUL
  
  outputUnigrams - true if unigrams for the selected writing systems should also be output. when this is false, this is only done when there are no adjacent characters to form a bigram.
Method Details
- incrementToken
  
  public boolean incrementToken() throws IOException
  
  Specified by:
  
  incrementToken in class TokenStream
  
  Throws:
  
  IOException
- reset
  
  public void reset() throws IOException
  
  Overrides:
  
  reset in class TokenFilter
  
  Throws:
  
  IOException

Class CJKBigramFilter

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.TokenFilter

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.TokenFilter

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Field Details

HAN

HIRAGANA

KATAKANA

HANGUL

DOUBLE_TYPE

SINGLE_TYPE

Constructor Details

CJKBigramFilter

CJKBigramFilter

CJKBigramFilter

Method Details

incrementToken

reset