org.apache.lucene.analysis.compound
Class CompoundWordTokenFilterBase
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
- All Implemented Interfaces:
- Closeable
- Direct Known Subclasses:
- DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter
public abstract class CompoundWordTokenFilterBase
- extends TokenFilter
Base class for decomposition token filters.
You must specify the required Version
compatibility when creating
CompoundWordTokenFilterBase:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0
supplementary characters in strings and char arrays provided as compound word
dictionaries.
- As of 4.4,
CompoundWordTokenFilterBase
doesn't update offsets.
Constructor Summary |
protected |
CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary)
|
protected |
CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
boolean onlyLongestMatch)
|
protected |
CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
DEFAULT_MIN_WORD_SIZE
public static final int DEFAULT_MIN_WORD_SIZE
- The default for minimal word length that gets decomposed
- See Also:
- Constant Field Values
DEFAULT_MIN_SUBWORD_SIZE
public static final int DEFAULT_MIN_SUBWORD_SIZE
- The default for minimal length of subwords that get propagated to the output of this filter
- See Also:
- Constant Field Values
DEFAULT_MAX_SUBWORD_SIZE
public static final int DEFAULT_MAX_SUBWORD_SIZE
- The default for maximal length of subwords that get propagated to the output of this filter
- See Also:
- Constant Field Values
matchVersion
protected final Version matchVersion
dictionary
protected final CharArraySet dictionary
tokens
protected final LinkedList<CompoundWordTokenFilterBase.CompoundToken> tokens
minWordSize
protected final int minWordSize
minSubwordSize
protected final int minSubwordSize
maxSubwordSize
protected final int maxSubwordSize
onlyLongestMatch
protected final boolean onlyLongestMatch
termAtt
protected final CharTermAttribute termAtt
offsetAtt
protected final OffsetAttribute offsetAtt
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
boolean onlyLongestMatch)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(Version matchVersion,
TokenStream input,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
incrementToken
public final boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class TokenStream
- Throws:
IOException
decompose
protected abstract void decompose()
- Decomposes the current
termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.
reset
public void reset()
throws IOException
- Overrides:
reset
in class TokenFilter
- Throws:
IOException
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.