|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.util.AttributeSource org.apache.lucene.analysis.TokenStream org.apache.lucene.analysis.TokenFilter org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
public class HyphenationCompoundWordTokenFilter
A TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
You must specify the required Version
compatibility when creating
CompoundWordTokenFilterBase:
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
CompoundWordTokenFilterBase.CompoundToken |
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
---|
AttributeSource.AttributeFactory, AttributeSource.State |
Field Summary |
---|
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
---|
input |
Constructor Summary | |
---|---|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
CharArraySet dictionary)
Creates a new HyphenationCompoundWordTokenFilter instance. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new HyphenationCompoundWordTokenFilter instance. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
int minWordSize,
int minSubwordSize,
int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary. |
Method Summary | |
---|---|
protected void |
decompose()
Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. |
static HyphenationTree |
getHyphenationTree(File hyphenationFile)
Create a hyphenator tree |
static HyphenationTree |
getHyphenationTree(InputSource hyphenationSource)
Create a hyphenator tree |
static HyphenationTree |
getHyphenationTree(String hyphenationFilename)
Create a hyphenator tree |
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
---|
incrementToken, reset |
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
---|
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
---|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)
HyphenationCompoundWordTokenFilter
instance.
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
HyphenationCompoundWordTokenFilter
instance.
matchVersion
- Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streampublic HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
public HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)
Method Detail |
---|
public static HyphenationTree getHyphenationTree(String hyphenationFilename) throws IOException
hyphenationFilename
- the filename of the XML grammar to load
IOException
- If there is a low-level I/O error.public static HyphenationTree getHyphenationTree(File hyphenationFile) throws IOException
hyphenationFile
- the file of the XML grammar to load
IOException
- If there is a low-level I/O error.public static HyphenationTree getHyphenationTree(InputSource hyphenationSource) throws IOException
hyphenationSource
- the InputSource pointing to the XML grammar
IOException
- If there is a low-level I/O error.protected void decompose()
CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the CompoundWordTokenFilterBase.tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.
decompose
in class CompoundWordTokenFilterBase
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |