org.apache.lucene.analysis.shingle
Class ShingleMatrixFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.shingle.ShingleMatrixFilter

public class ShingleMatrixFilter
extends TokenStream

A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.

Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.

Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.

This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.

In order to use this filter you need to define a way of positioning the input stream tokens in the matrix. This is done using a ShingleMatrixFilter.TokenSettingsCodec. There are three simple implementations for demonstrational purposes, see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec, ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec.

Consider this token matrix:

  Token[column][row][z-axis]{
    {{hello}, {greetings, and, salutations}},
    {{world}, {earth}, {tellus}}
  };
 
It would produce the following 2-3 gram sized shingles:
 "hello_world"
 "greetings_and"
 "greetings_and_salutations"
 "and_salutations"
 "and_salutations_world"
 "salutations_world"
 "hello_earth"
 "and_salutations_earth"
 "salutations_earth"
 "hello_tellus"
 "and_salutations_tellus"
 "salutations_tellus"
  

This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.

The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer. There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.

The filter also has basic support for calculating weights for the shingles based on the weights of the tokens from the input stream, output shingle size, etc. See calculateShingleWeight(org.apache.lucene.analysis.Token, java.util.List, int, java.util.List, java.util.List).

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.


Nested Class Summary
static class ShingleMatrixFilter.Matrix
          A column focused matrix in three dimensions:
static class ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec
          Using this codec makes a ShingleMatrixFilter act like ShingleFilter.
static class ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec
          A full featured codec not to be used for something serious.
static class ShingleMatrixFilter.TokenPositioner
          Used to describe how a Token is to be inserted to a ShingleMatrixFilter.Matrix.
static class ShingleMatrixFilter.TokenSettingsCodec
          Strategy used to code and decode meta data of the tokens from the input stream regarding how to position the tokens in the matrix, set and retreive weight, et c.
static class ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec
          A codec that creates a two dimensional matrix by treating tokens from the input stream with 0 position increment as new rows to the current column.
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
static ShingleMatrixFilter.TokenSettingsCodec defaultSettingsCodec
           
static Character defaultSpacerCharacter
           
static boolean ignoringSinglePrefixOrSuffixShingleByDefault
           
 
Constructor Summary
ShingleMatrixFilter(ShingleMatrixFilter.Matrix matrix, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle, ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
          Creates a shingle filter based on a user defined matrix.
ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize)
          Creates a shingle filter using default settings.
ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter)
          Creates a shingle filter using default settings.
ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle)
          Creates a shingle filter using the default ShingleMatrixFilter.TokenSettingsCodec.
ShingleMatrixFilter(TokenStream input, int minimumShingleSize, int maximumShingleSize, Character spacerCharacter, boolean ignoringSinglePrefixOrSuffixShingle, ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
          Creates a shingle filter with ad hoc parameter settings.
 
Method Summary
 float calculateShingleWeight(Token shingleToken, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens)
          Evaluates the new shingle token weight.
 ShingleMatrixFilter.Matrix getMatrix()
           
 int getMaximumShingleSize()
           
 int getMinimumShingleSize()
           
 Character getSpacerCharacter()
           
 boolean incrementToken()
          Consumers (i.e., IndexWriter) use this method to advance the stream to the next token.
 boolean isIgnoringSinglePrefixOrSuffixShingle()
           
 Token next()
          Deprecated. Will be removed in Lucene 3.0. This method is final, as it should not be overridden. Delegates to the backwards compatibility layer.
 Token next(Token reusableToken)
          Deprecated. Will be removed in Lucene 3.0. This method is final, as it should not be overridden. Delegates to the backwards compatibility layer.
 void reset()
          Resets this stream to the beginning.
 void setIgnoringSinglePrefixOrSuffixShingle(boolean ignoringSinglePrefixOrSuffixShingle)
           
 void setMatrix(ShingleMatrixFilter.Matrix matrix)
           
 void setMaximumShingleSize(int maximumShingleSize)
           
 void setMinimumShingleSize(int minimumShingleSize)
           
 void setSpacerCharacter(Character spacerCharacter)
           
 void updateToken(Token token, List shingle, int currentPermutationStartOffset, List currentPermutationRows, List currentPermuationTokens)
          Final touch of a shingle token before it is passed on to the consumer from method next(org.apache.lucene.analysis.Token).
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
close, end, getOnlyUseNewAPI, setOnlyUseNewAPI
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

defaultSpacerCharacter

public static Character defaultSpacerCharacter

defaultSettingsCodec

public static ShingleMatrixFilter.TokenSettingsCodec defaultSettingsCodec

ignoringSinglePrefixOrSuffixShingleByDefault

public static boolean ignoringSinglePrefixOrSuffixShingleByDefault
Constructor Detail

ShingleMatrixFilter

public ShingleMatrixFilter(ShingleMatrixFilter.Matrix matrix,
                           int minimumShingleSize,
                           int maximumShingleSize,
                           Character spacerCharacter,
                           boolean ignoringSinglePrefixOrSuffixShingle,
                           ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a boolean, set the input stream to null or something, and keep track of where in the matrix we are at.

Parameters:
matrix - the input based for creating shingles. Does not need to contain any information until next(org.apache.lucene.analysis.Token) is called the first time.
minimumShingleSize - minimum number of tokens in any shingle.
maximumShingleSize - maximum number of tokens in any shingle.
spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec - codec used to read input token weight and matrix positioning.

ShingleMatrixFilter

public ShingleMatrixFilter(TokenStream input,
                           int minimumShingleSize,
                           int maximumShingleSize)
Creates a shingle filter using default settings.

Parameters:
input - stream from which to construct the matrix
minimumShingleSize - minimum number of tokens in any shingle.
maximumShingleSize - maximum number of tokens in any shingle.
See Also:
defaultSpacerCharacter, ignoringSinglePrefixOrSuffixShingleByDefault, defaultSettingsCodec

ShingleMatrixFilter

public ShingleMatrixFilter(TokenStream input,
                           int minimumShingleSize,
                           int maximumShingleSize,
                           Character spacerCharacter)
Creates a shingle filter using default settings.

Parameters:
input - stream from which to construct the matrix
minimumShingleSize - minimum number of tokens in any shingle.
maximumShingleSize - maximum number of tokens in any shingle.
spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
See Also:
ignoringSinglePrefixOrSuffixShingleByDefault, defaultSettingsCodec

ShingleMatrixFilter

public ShingleMatrixFilter(TokenStream input,
                           int minimumShingleSize,
                           int maximumShingleSize,
                           Character spacerCharacter,
                           boolean ignoringSinglePrefixOrSuffixShingle)
Creates a shingle filter using the default ShingleMatrixFilter.TokenSettingsCodec.

Parameters:
input - stream from which to construct the matrix
minimumShingleSize - minimum number of tokens in any shingle.
maximumShingleSize - maximum number of tokens in any shingle.
spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
See Also:
defaultSettingsCodec

ShingleMatrixFilter

public ShingleMatrixFilter(TokenStream input,
                           int minimumShingleSize,
                           int maximumShingleSize,
                           Character spacerCharacter,
                           boolean ignoringSinglePrefixOrSuffixShingle,
                           ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
Creates a shingle filter with ad hoc parameter settings.

Parameters:
input - stream from which to construct the matrix
minimumShingleSize - minimum number of tokens in any shingle.
maximumShingleSize - maximum number of tokens in any shingle.
spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodec - codec used to read input token weight and matrix positioning.
Method Detail

reset

public void reset()
           throws IOException
Description copied from class: TokenStream
Resets this stream to the beginning. This is an optional operation, so subclasses may or may not implement this method. TokenStream.reset() is not needed for the standard indexing process. However, if the tokens of a TokenStream are intended to be consumed more than once, it is necessary to implement TokenStream.reset(). Note that if your TokenStream caches tokens and feeds them back again after a reset, it is imperative that you clone the tokens when you store them away (on the first pass) as well as when you return them (on future passes after TokenStream.reset()).

Overrides:
reset in class TokenStream
Throws:
IOException

incrementToken

public final boolean incrementToken()
                             throws IOException
Description copied from class: TokenStream
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.captureState() to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.addAttribute(Class) and AttributeSource.getAttribute(Class) or downcasts, references to all AttributeImpls that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in TokenStream.incrementToken().

Overrides:
incrementToken in class TokenStream
Returns:
false for end of stream; true otherwise

Note that this method will be defined abstract in Lucene 3.0.

Throws:
IOException

next

public final Token next(Token reusableToken)
                 throws IOException
Deprecated. Will be removed in Lucene 3.0. This method is final, as it should not be overridden. Delegates to the backwards compatibility layer.

Description copied from class: TokenStream
Returns the next token in the stream, or null at EOS. When possible, the input Token should be used as the returned Token (this gives fastest tokenization performance), but this is not required and a new Token may be returned. Callers may re-use a single Token instance for successive calls to this method.

This implicitly defines a "contract" between consumers (callers of this method) and producers (implementations of this method that are the source for tokens):

Also, the producer must make no assumptions about a Token after it has been returned: the caller may arbitrarily change it. If the producer needs to hold onto the Token for subsequent calls, it must clone() it before storing it. Note that a TokenFilter is considered a consumer.

Overrides:
next in class TokenStream
Parameters:
reusableToken - a Token that may or may not be used to return; this parameter should never be null (the callee is not required to check for null before using it, but it is a good idea to assert that it is not null.)
Returns:
next Token in the stream or null if end-of-stream was hit
Throws:
IOException

next

public final Token next()
                 throws IOException
Deprecated. Will be removed in Lucene 3.0. This method is final, as it should not be overridden. Delegates to the backwards compatibility layer.

Description copied from class: TokenStream
Returns the next Token in the stream, or null at EOS.

Overrides:
next in class TokenStream
Throws:
IOException

updateToken

public void updateToken(Token token,
                        List shingle,
                        int currentPermutationStartOffset,
                        List currentPermutationRows,
                        List currentPermuationTokens)
Final touch of a shingle token before it is passed on to the consumer from method next(org.apache.lucene.analysis.Token). Calculates and sets type, flags, position increment, start/end offsets and weight.

Parameters:
token - Shingle token
shingle - Tokens used to produce the shingle token.
currentPermutationStartOffset - Start offset in parameter currentPermutationTokens
currentPermutationRows - index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens
currentPermuationTokens - tokens of the current permutation of rows in the matrix.

calculateShingleWeight

public float calculateShingleWeight(Token shingleToken,
                                    List shingle,
                                    int currentPermutationStartOffset,
                                    List currentPermutationRows,
                                    List currentPermuationTokens)
Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.

Parameters:
shingleToken - token returned to consumer
shingle - tokens the tokens used to produce the shingle token.
currentPermutationStartOffset - start offset in parameter currentPermutationRows and currentPermutationTokens.
currentPermutationRows - an index to what matrix row a token in parameter currentPermutationTokens exist.
currentPermuationTokens - all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.
Returns:
weight to be set for parameter shingleToken

getMinimumShingleSize

public int getMinimumShingleSize()

setMinimumShingleSize

public void setMinimumShingleSize(int minimumShingleSize)

getMaximumShingleSize

public int getMaximumShingleSize()

setMaximumShingleSize

public void setMaximumShingleSize(int maximumShingleSize)

getMatrix

public ShingleMatrixFilter.Matrix getMatrix()

setMatrix

public void setMatrix(ShingleMatrixFilter.Matrix matrix)

getSpacerCharacter

public Character getSpacerCharacter()

setSpacerCharacter

public void setSpacerCharacter(Character spacerCharacter)

isIgnoringSinglePrefixOrSuffixShingle

public boolean isIgnoringSinglePrefixOrSuffixShingle()

setIgnoringSinglePrefixOrSuffixShingle

public void setIgnoringSinglePrefixOrSuffixShingle(boolean ignoringSinglePrefixOrSuffixShingle)


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.