public class DictionaryCompoundWordTokenFilter extends CompoundWordTokenFilterBase
TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
You may specify the Version
compatibility when creating
CompoundWordTokenFilterBase:
CompoundWordTokenFilterBase.CompoundToken
AttributeSource.State
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, matchVersion, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
input
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
DEFAULT_ATTRIBUTE_FACTORY
Constructor and Description |
---|
DictionaryCompoundWordTokenFilter(TokenStream input,
CharArraySet dictionary)
Creates a new
DictionaryCompoundWordTokenFilter |
DictionaryCompoundWordTokenFilter(TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new
DictionaryCompoundWordTokenFilter |
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary)
Deprecated.
|
DictionaryCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
Modifier and Type | Method and Description |
---|---|
protected void |
decompose()
Decomposes the current
CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. |
incrementToken, reset
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary)
DictionaryCompoundWordTokenFilter
input
- the TokenStream
to processdictionary
- the word dictionary to match against.@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, CharArraySet dictionary)
DictionaryCompoundWordTokenFilter(TokenStream,CharArraySet)
public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
DictionaryCompoundWordTokenFilter
input
- the TokenStream
to processdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream@Deprecated public DictionaryCompoundWordTokenFilter(Version matchVersion, TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
protected void decompose()
CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the CompoundWordTokenFilterBase.tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.decompose
in class CompoundWordTokenFilterBase
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.