org.apache.lucene.analysis.compound
Class CompoundWordTokenFilterBase
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
- All Implemented Interfaces:
- Closeable
- Direct Known Subclasses:
- DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter
public abstract class CompoundWordTokenFilterBase
- extends org.apache.lucene.analysis.TokenFilter
Base class for decomposition token filters.
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
input |
Constructor Summary |
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary)
|
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary,
boolean onlyLongestMatch)
|
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary)
|
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary,
boolean onlyLongestMatch)
|
protected |
CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
Method Summary |
protected static void |
addAllLowerCase(Set target,
Collection col)
|
protected org.apache.lucene.analysis.Token |
createToken(int offset,
int length,
org.apache.lucene.analysis.Token prototype)
|
protected void |
decompose(org.apache.lucene.analysis.Token token)
|
protected abstract void |
decomposeInternal(org.apache.lucene.analysis.Token token)
|
boolean |
incrementToken()
|
static Set |
makeDictionary(String[] dictionary)
Create a set of words from an array
The resulting Set does case insensitive matching
TODO We should look for a faster dictionary lookup approach. |
protected static char[] |
makeLowerCaseCopy(char[] buffer)
|
void |
reset()
|
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
DEFAULT_MIN_WORD_SIZE
public static final int DEFAULT_MIN_WORD_SIZE
- The default for minimal word length that gets decomposed
- See Also:
- Constant Field Values
DEFAULT_MIN_SUBWORD_SIZE
public static final int DEFAULT_MIN_SUBWORD_SIZE
- The default for minimal length of subwords that get propagated to the output of this filter
- See Also:
- Constant Field Values
DEFAULT_MAX_SUBWORD_SIZE
public static final int DEFAULT_MAX_SUBWORD_SIZE
- The default for maximal length of subwords that get propagated to the output of this filter
- See Also:
- Constant Field Values
dictionary
protected final org.apache.lucene.analysis.CharArraySet dictionary
tokens
protected final LinkedList tokens
minWordSize
protected final int minWordSize
minSubwordSize
protected final int minSubwordSize
maxSubwordSize
protected final int maxSubwordSize
onlyLongestMatch
protected final boolean onlyLongestMatch
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary,
boolean onlyLongestMatch)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary,
boolean onlyLongestMatch)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
String[] dictionary)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary)
CompoundWordTokenFilterBase
protected CompoundWordTokenFilterBase(org.apache.lucene.analysis.TokenStream input,
Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
makeDictionary
public static final Set makeDictionary(String[] dictionary)
- Create a set of words from an array
The resulting Set does case insensitive matching
TODO We should look for a faster dictionary lookup approach.
- Parameters:
dictionary
-
- Returns:
Set
of lowercased terms
incrementToken
public final boolean incrementToken()
throws IOException
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
addAllLowerCase
protected static final void addAllLowerCase(Set target,
Collection col)
makeLowerCaseCopy
protected static char[] makeLowerCaseCopy(char[] buffer)
createToken
protected final org.apache.lucene.analysis.Token createToken(int offset,
int length,
org.apache.lucene.analysis.Token prototype)
decompose
protected void decompose(org.apache.lucene.analysis.Token token)
decomposeInternal
protected abstract void decomposeInternal(org.apache.lucene.analysis.Token token)
reset
public void reset()
throws IOException
- Overrides:
reset
in class org.apache.lucene.analysis.TokenFilter
- Throws:
IOException
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.