org.apache.lucene.analysis.tokenattributes
. It also uses
hardcoded payload encoders which makes it not easily adaptable to other use-cases.@Deprecated public final class ShingleMatrixFilter extends TokenStream
A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.
For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.
Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.
Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.
This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.
In order to use this filter you need to define a way of positioning
the input stream tokens in the matrix. This is done using a
ShingleMatrixFilter.TokenSettingsCodec
.
There are three simple implementations for demonstrational purposes,
see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec
,
ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec
and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec
.
Consider this token matrix:
Token[column][row][z-axis]{ {{hello}, {greetings, and, salutations}}, {{world}, {earth}, {tellus}} };It would produce the following 2-3 gram sized shingles:
"hello_world" "greetings_and" "greetings_and_salutations" "and_salutations" "and_salutations_world" "salutations_world" "hello_earth" "and_salutations_earth" "salutations_earth" "hello_tellus" "and_salutations_tellus" "salutations_tellus"
This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.
The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer. There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.
The filter also has basic support for calculating weights for the shingles
based on the weights of the tokens from the input stream, output shingle size, etc.
See calculateShingleWeight(org.apache.lucene.analysis.Token, java.util.List, int, java.util.List, java.util.List)
.
Modifier and Type | Class and Description |
---|---|
static class |
ShingleMatrixFilter.Matrix
Deprecated.
A column focused matrix in three dimensions:
Token[column][row][z-axis] {
{{hello}, {greetings, and, salutations}},
{{world}, {earth}, {tellus}}
};
todo consider row groups
to indicate that shingles is only to contain permutations with texts in that same row group.
|
static class |
ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec
Deprecated.
Using this codec makes a
ShingleMatrixFilter act like ShingleFilter . |
static class |
ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec
Deprecated.
A full featured codec not to be used for something serious.
|
static class |
ShingleMatrixFilter.TokenPositioner
Deprecated.
Used to describe how a
Token is to be inserted to a ShingleMatrixFilter.Matrix . |
static class |
ShingleMatrixFilter.TokenSettingsCodec
Deprecated.
Strategy used to code and decode meta data of the tokens from the input stream
regarding how to position the tokens in the matrix, set and retreive weight, et c.
|
static class |
ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec
Deprecated.
A codec that creates a two dimensional matrix
by treating tokens from the input stream with 0 position increment
as new rows to the current column.
|
AttributeSource.AttributeFactory, AttributeSource.State
Modifier and Type | Field and Description |
---|---|
static ShingleMatrixFilter.TokenSettingsCodec |
defaultSettingsCodec
Deprecated.
|
static Character |
defaultSpacerCharacter
Deprecated.
|
static boolean |
ignoringSinglePrefixOrSuffixShingleByDefault
Deprecated.
|
Constructor and Description |
---|
ShingleMatrixFilter(ShingleMatrixFilter.Matrix matrix,
int minimumShingleSize,
int maximumShingleSize,
Character spacerCharacter,
boolean ignoringSinglePrefixOrSuffixShingle,
ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
Deprecated.
Creates a shingle filter based on a user defined matrix.
|
ShingleMatrixFilter(TokenStream input,
int minimumShingleSize,
int maximumShingleSize)
Deprecated.
Creates a shingle filter using default settings.
|
ShingleMatrixFilter(TokenStream input,
int minimumShingleSize,
int maximumShingleSize,
Character spacerCharacter)
Deprecated.
Creates a shingle filter using default settings.
|
ShingleMatrixFilter(TokenStream input,
int minimumShingleSize,
int maximumShingleSize,
Character spacerCharacter,
boolean ignoringSinglePrefixOrSuffixShingle)
Deprecated.
Creates a shingle filter using the default
ShingleMatrixFilter.TokenSettingsCodec . |
ShingleMatrixFilter(TokenStream input,
int minimumShingleSize,
int maximumShingleSize,
Character spacerCharacter,
boolean ignoringSinglePrefixOrSuffixShingle,
ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
Deprecated.
Creates a shingle filter with ad hoc parameter settings.
|
Modifier and Type | Method and Description |
---|---|
float |
calculateShingleWeight(Token shingleToken,
List<Token> shingle,
int currentPermutationStartOffset,
List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows,
List<Token> currentPermuationTokens)
Deprecated.
Evaluates the new shingle token weight.
|
ShingleMatrixFilter.Matrix |
getMatrix()
Deprecated.
|
int |
getMaximumShingleSize()
Deprecated.
|
int |
getMinimumShingleSize()
Deprecated.
|
Character |
getSpacerCharacter()
Deprecated.
|
boolean |
incrementToken()
Deprecated.
Consumers (i.e.,
IndexWriter ) use this method to advance the stream to
the next token. |
boolean |
isIgnoringSinglePrefixOrSuffixShingle()
Deprecated.
|
void |
reset()
Deprecated.
Resets this stream to the beginning.
|
void |
setIgnoringSinglePrefixOrSuffixShingle(boolean ignoringSinglePrefixOrSuffixShingle)
Deprecated.
|
void |
setMatrix(ShingleMatrixFilter.Matrix matrix)
Deprecated.
|
void |
setMaximumShingleSize(int maximumShingleSize)
Deprecated.
|
void |
setMinimumShingleSize(int minimumShingleSize)
Deprecated.
|
void |
setSpacerCharacter(Character spacerCharacter)
Deprecated.
|
void |
updateToken(Token token,
List<Token> shingle,
int currentPermutationStartOffset,
List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows,
List<Token> currentPermuationTokens)
Deprecated.
Final touch of a shingle token before it is passed on to the consumer from method
incrementToken() . |
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
public static Character defaultSpacerCharacter
public static ShingleMatrixFilter.TokenSettingsCodec defaultSettingsCodec
public static boolean ignoringSinglePrefixOrSuffixShingleByDefault
public ShingleMatrixFilter(ShingleMatrixFilter.Matrix matrix, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle, ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
matrix
- the input based for creating shingles. Does not need to contain any information until incrementToken()
is called the first time.minimumShingleSize
- minimum number of tokens in any shingle.maximumShingleSize
- maximum number of tokens in any shingle.spacerCharacter
- character to use between texts of the token parts in a shingle. null for none.ignoringSinglePrefixOrSuffixShingle
- if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.settingsCodec
- codec used to read input token weight and matrix positioning.public ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize)
input
- stream from which to construct the matrixminimumShingleSize
- minimum number of tokens in any shingle.maximumShingleSize
- maximum number of tokens in any shingle.defaultSpacerCharacter
,
ignoringSinglePrefixOrSuffixShingleByDefault
,
defaultSettingsCodec
public ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter)
input
- stream from which to construct the matrixminimumShingleSize
- minimum number of tokens in any shingle.maximumShingleSize
- maximum number of tokens in any shingle.spacerCharacter
- character to use between texts of the token parts in a shingle. null for none.ignoringSinglePrefixOrSuffixShingleByDefault
,
defaultSettingsCodec
public ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle)
ShingleMatrixFilter.TokenSettingsCodec
.input
- stream from which to construct the matrixminimumShingleSize
- minimum number of tokens in any shingle.maximumShingleSize
- maximum number of tokens in any shingle.spacerCharacter
- character to use between texts of the token parts in a shingle. null for none.ignoringSinglePrefixOrSuffixShingle
- if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.defaultSettingsCodec
public ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle, ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
input
- stream from which to construct the matrixminimumShingleSize
- minimum number of tokens in any shingle.maximumShingleSize
- maximum number of tokens in any shingle.spacerCharacter
- character to use between texts of the token parts in a shingle. null for none.ignoringSinglePrefixOrSuffixShingle
- if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.settingsCodec
- codec used to read input token weight and matrix positioning.public void reset() throws IOException
TokenStream
TokenStream.reset()
is not needed for
the standard indexing process. However, if the tokens of a
TokenStream
are intended to be consumed more than once, it is
necessary to implement TokenStream.reset()
. Note that if your TokenStream
caches tokens and feeds them back again after a reset, it is imperative
that you clone the tokens when you store them away (on the first pass) as
well as when you return them (on future passes after TokenStream.reset()
).reset
in class TokenStream
IOException
public final boolean incrementToken() throws IOException
TokenStream
IndexWriter
) use this method to advance the stream to
the next token. Implementing classes must implement this method and update
the appropriate AttributeImpl
s with the attributes of the next
token.
The producer must make no assumptions about the attributes after the method
has been returned: the caller may arbitrarily change it. If the producer
needs to preserve the state for subsequent calls, it can use
AttributeSource.captureState()
to create a copy of the current attribute state.
This method is called for every token of a document, so an efficient
implementation is crucial for good performance. To avoid calls to
AttributeSource.addAttribute(Class)
and AttributeSource.getAttribute(Class)
,
references to all AttributeImpl
s that this stream uses should be
retrieved during instantiation.
To ensure that filters and consumers know which attributes are available,
the attributes must be added during instantiation. Filters and consumers
are not required to check for availability of attributes in
TokenStream.incrementToken()
.
incrementToken
in class TokenStream
IOException
public void updateToken(Token token, List<Token> shingle, int currentPermutationStartOffset, List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows, List<Token> currentPermuationTokens)
incrementToken()
.
Calculates and sets type, flags, position increment, start/end offsets and weight.token
- Shingle tokenshingle
- Tokens used to produce the shingle token.currentPermutationStartOffset
- Start offset in parameter currentPermutationTokenscurrentPermutationRows
- index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokenscurrentPermuationTokens
- tokens of the current permutation of rows in the matrix.public float calculateShingleWeight(Token shingleToken, List<Token> shingle, int currentPermutationStartOffset, List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows, List<Token> currentPermuationTokens)
shingleToken
- token returned to consumershingle
- tokens the tokens used to produce the shingle token.currentPermutationStartOffset
- start offset in parameter currentPermutationRows and currentPermutationTokens.currentPermutationRows
- an index to what matrix row a token in parameter currentPermutationTokens exist.currentPermuationTokens
- all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.public int getMinimumShingleSize()
public void setMinimumShingleSize(int minimumShingleSize)
public int getMaximumShingleSize()
public void setMaximumShingleSize(int maximumShingleSize)
public ShingleMatrixFilter.Matrix getMatrix()
public void setMatrix(ShingleMatrixFilter.Matrix matrix)
public Character getSpacerCharacter()
public void setSpacerCharacter(Character spacerCharacter)
public boolean isIgnoringSinglePrefixOrSuffixShingle()
public void setIgnoringSinglePrefixOrSuffixShingle(boolean ignoringSinglePrefixOrSuffixShingle)