org.apache.lucene.analysis.commongrams
Class CommonGramsFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.commongrams.CommonGramsFilter
- All Implemented Interfaces:
- Closeable
public final class CommonGramsFilter
- extends TokenFilter
Construct bigrams for frequently occurring terms while indexing. Single terms
are still indexed too, with bigrams overlaid. This is achieved through the
use of PositionIncrementAttribute.setPositionIncrement(int)
. Bigrams have a type
of GRAM_TYPE
Example:
- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position
as "the" "the-quick" has a term.type() of "gram"
Method Summary |
boolean |
incrementToken()
Inserts bigrams for common words into a token stream. |
void |
reset()
|
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
GRAM_TYPE
public static final String GRAM_TYPE
- See Also:
- Constant Field Values
CommonGramsFilter
public CommonGramsFilter(Version matchVersion,
TokenStream input,
CharArraySet commonWords)
- Construct a token stream filtering the given input using a Set of common
words to create bigrams. Outputs both unigrams with position increment and
bigrams with position increment 0 type=gram where one or both of the words
in a potential bigram are in the set of common words .
- Parameters:
input
- TokenStream input in filter chaincommonWords
- The set of common words.
incrementToken
public boolean incrementToken()
throws IOException
- Inserts bigrams for common words into a token stream. For each input token,
output the token. If the token and/or the following token are in the list
of common words also output a bigram with position increment 0 and
type="gram"
TODO:Consider adding an option to not emit unigram stopwords
as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be
changed to work with this.
TODO: Consider optimizing for the case of three
commongrams i.e "man of the year" normally produces 3 bigrams: "man-of",
"of-the", "the-year" but with proper management of positions we could
eliminate the middle bigram "of-the"and save a disk seek and a whole set of
position lookups.
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
-
- Overrides:
reset
in class TokenFilter
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.