Class DictionaryCompoundWordTokenFilter

  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
                  extended by org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
All Implemented Interfaces:

public class DictionaryCompoundWordTokenFilter
extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
Field Summary
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, onlyLongestMatch, tokens
Fields inherited from class org.apache.lucene.analysis.TokenFilter
Constructor Summary
DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set) instead
DictionaryCompoundWordTokenFilter(TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean) instead
DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[]) instead
DictionaryCompoundWordTokenFilter(TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean) instead
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set dictionary)
          Creates a new DictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new DictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary)
          Creates a new DictionaryCompoundWordTokenFilter
DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new DictionaryCompoundWordTokenFilter
Method Summary
protected  void decomposeInternal(Token token)
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
addAllLowerCase, createToken, decompose, incrementToken, makeDictionary, makeDictionary, makeLowerCaseCopy, reset
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Detail


public DictionaryCompoundWordTokenFilter(TokenStream input,
                                                    String[] dictionary,
                                                    int minWordSize,
                                                    int minSubwordSize,
                                                    int maxSubwordSize,
                                                    boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[], int, int, int, boolean) instead

Creates a new DictionaryCompoundWordTokenFilter

input - the TokenStream to process
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream


public DictionaryCompoundWordTokenFilter(TokenStream input,
                                                    String[] dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, String[]) instead

Creates a new DictionaryCompoundWordTokenFilter

input - the TokenStream to process
dictionary - the word dictionary to match against


public DictionaryCompoundWordTokenFilter(TokenStream input,
                                                    Set dictionary)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set) instead

Creates a new DictionaryCompoundWordTokenFilter

input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.


public DictionaryCompoundWordTokenFilter(TokenStream input,
                                                    Set dictionary,
                                                    int minWordSize,
                                                    int minSubwordSize,
                                                    int maxSubwordSize,
                                                    boolean onlyLongestMatch)
Deprecated. use DictionaryCompoundWordTokenFilter(Version, TokenStream, Set, int, int, int, boolean) instead

Creates a new DictionaryCompoundWordTokenFilter

input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream


public DictionaryCompoundWordTokenFilter(Version matchVersion,
                                         TokenStream input,
                                         String[] dictionary,
                                         int minWordSize,
                                         int minSubwordSize,
                                         int maxSubwordSize,
                                         boolean onlyLongestMatch)
Creates a new DictionaryCompoundWordTokenFilter

matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream


public DictionaryCompoundWordTokenFilter(Version matchVersion,
                                         TokenStream input,
                                         String[] dictionary)
Creates a new DictionaryCompoundWordTokenFilter

matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against


public DictionaryCompoundWordTokenFilter(Version matchVersion,
                                         TokenStream input,
                                         Set dictionary)
Creates a new DictionaryCompoundWordTokenFilter

matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.


public DictionaryCompoundWordTokenFilter(Version matchVersion,
                                         TokenStream input,
                                         Set dictionary,
                                         int minWordSize,
                                         int minSubwordSize,
                                         int maxSubwordSize,
                                         boolean onlyLongestMatch)
Creates a new DictionaryCompoundWordTokenFilter

matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream
Method Detail


protected void decomposeInternal(Token token)
Specified by:
decomposeInternal in class CompoundWordTokenFilterBase

Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.