org.apache.lucene.analysis.compound
Class DictionaryCompoundWordTokenFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
                  extended by org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
All Implemented Interfaces:
Closeable

public class DictionaryCompoundWordTokenFilter
extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, Set dictionary)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set) instead
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean) instead
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, String[] dictionary)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[]) instead
DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean) instead
DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, Set<?> dictionary)
          Creates a new DictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new DictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, String[] dictionary)
          Deprecated. Use the constructors taking Set
DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. Use the constructors taking Set
 
Method Summary
protected  void decompose()
          Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list.
 
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, makeDictionary, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                                    String[] dictionary,
                                                    int minWordSize,
                                                    int minSubwordSize,
                                                    int maxSubwordSize,
                                                    boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean) instead

Creates a new DictionaryCompoundWordTokenFilter.

Parameters:
input - the TokenStream to process
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                                    String[] dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[]) instead

Creates a new DictionaryCompoundWordTokenFilter

Parameters:
input - the TokenStream to process
dictionary - the word dictionary to match against

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                                    Set dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set) instead

Creates a new DictionaryCompoundWordTokenFilter

Parameters:
input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                                    Set dictionary,
                                                    int minWordSize,
                                                    int minSubwordSize,
                                                    int maxSubwordSize,
                                                    boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean) instead

Creates a new DictionaryCompoundWordTokenFilter

Parameters:
input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion,
                                                    org.apache.lucene.analysis.TokenStream input,
                                                    String[] dictionary,
                                                    int minWordSize,
                                                    int minSubwordSize,
                                                    int maxSubwordSize,
                                                    boolean onlyLongestMatch)
Deprecated. Use the constructors taking Set

Creates a new DictionaryCompoundWordTokenFilter

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

DictionaryCompoundWordTokenFilter

@Deprecated
public DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion,
                                                    org.apache.lucene.analysis.TokenStream input,
                                                    String[] dictionary)
Deprecated. Use the constructors taking Set

Creates a new DictionaryCompoundWordTokenFilter

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against

DictionaryCompoundWordTokenFilter

public DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.TokenStream input,
                                         Set<?> dictionary)
Creates a new DictionaryCompoundWordTokenFilter

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against.

DictionaryCompoundWordTokenFilter

public DictionaryCompoundWordTokenFilter(org.apache.lucene.util.Version matchVersion,
                                         org.apache.lucene.analysis.TokenStream input,
                                         Set<?> dictionary,
                                         int minWordSize,
                                         int minSubwordSize,
                                         int maxSubwordSize,
                                         boolean onlyLongestMatch)
Creates a new DictionaryCompoundWordTokenFilter

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream
Method Detail

decompose

protected void decompose()
Description copied from class: CompoundWordTokenFilterBase
Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. The original token may not be placed in the list, as it is automatically passed through this filter.

Specified by:
decompose in class CompoundWordTokenFilterBase


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.