CommonGramsFilter (Lucene 4.2.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.commongrams
Class CommonGramsFilter

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.TokenFilter
              org.apache.lucene.analysis.commongrams.CommonGramsFilter

All Implemented Interfaces:: Closeable

public final class CommonGramsFilter
extends TokenFilter
extends TokenFilter

Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of PositionIncrementAttribute.setPositionIncrement(int). Bigrams have a type of GRAM_TYPE Example:

input:"the quick brown fox"
output:|"the","the-quick"|"brown"|"fox"|
"the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram"

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static String`	`GRAM_TYPE`

Fields inherited from class org.apache.lucene.analysis.TokenFilter
`input`

Constructor Summary
`CommonGramsFilter(Version matchVersion, TokenStream input, CharArraySet commonWords)` Construct a token stream filtering the given input using a Set of common words to create bigrams.

Method Summary
`boolean`	`incrementToken()` Inserts bigrams for common words into a token stream.
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.TokenFilter
`close, end`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait`

Field Detail

GRAM_TYPE

public static final String GRAM_TYPE

See Also:: Constant Field Values

Constructor Detail

CommonGramsFilter

public CommonGramsFilter(Version matchVersion,
                         TokenStream input,
                         CharArraySet commonWords)

Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .

Parameters:: input - TokenStream input in filter chain; commonWords - The set of common words.

Method Detail

incrementToken

public boolean incrementToken()
                       throws IOException

Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram" TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this. TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups.

Specified by:: incrementToken in class TokenStream

Throws:: IOException

reset

public void reset()
           throws IOException

Overrides:: reset in class TokenFilter

Throws:: IOException