org.apache.lucene.analysis.compound
Class DictionaryCompoundWordTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
- All Implemented Interfaces:
- Closeable
public class DictionaryCompoundWordTokenFilter
- extends CompoundWordTokenFilterBase
A TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff".
It uses a brute-force algorithm to achieve this.
You must specify the required Version
compatibility when creating
CompoundWordTokenFilterBase:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0
supplementary characters in strings and char arrays provided as compound word
dictionaries.
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, matchVersion, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
DictionaryCompoundWordTokenFilter
public DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary)
- Creates a new
DictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.
DictionaryCompoundWordTokenFilter
public DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
- Creates a new
DictionaryCompoundWordTokenFilter
- Parameters:
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
decompose
protected void decompose()
- Description copied from class:
CompoundWordTokenFilterBase
- Decomposes the current
CompoundWordTokenFilterBase.termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the CompoundWordTokenFilterBase.tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.
- Specified by:
decompose
in class CompoundWordTokenFilterBase
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.