Class CommonGramsFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.commongrams.CommonGramsFilter
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
public final class CommonGramsFilter extends TokenFilter
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use ofPositionIncrementAttribute.setPositionIncrement(int)
. Bigrams have a type ofGRAM_TYPE
Example:- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram"
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
Fields Modifier and Type Field Description static String
GRAM_TYPE
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description CommonGramsFilter(TokenStream input, CharArraySet commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
incrementToken()
Inserts bigrams for common words into a token stream.void
reset()
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Field Detail
-
GRAM_TYPE
public static final String GRAM_TYPE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
CommonGramsFilter
public CommonGramsFilter(TokenStream input, CharArraySet commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .- Parameters:
input
- TokenStream input in filter chaincommonWords
- The set of common words.
-
-
Method Detail
-
incrementToken
public boolean incrementToken() throws IOException
Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram"TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this.
TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups.
- Specified by:
incrementToken
in classTokenStream
- Throws:
IOException
-
reset
public void reset() throws IOException
- Overrides:
reset
in classTokenFilter
- Throws:
IOException
-
-