Class HyphenationCompoundWordTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Unwrappable<TokenStream>
A
TokenFilter
that decomposes compound words found in many
Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorDescriptionHyphenationCompoundWordTokenFilter
(TokenStream input, HyphenationTree hyphenator) Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter
(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize) Create a HyphenationCompoundWordTokenFilter with no dictionary.HyphenationCompoundWordTokenFilter
(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary) Creates a newHyphenationCompoundWordTokenFilter
instance.HyphenationCompoundWordTokenFilter
(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch) Creates a newHyphenationCompoundWordTokenFilter
instance. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
Decomposes the currentCompoundWordTokenFilterBase.termAtt
and placesCompoundWordTokenFilterBase.CompoundToken
instances in theCompoundWordTokenFilterBase.tokens
list.static HyphenationTree
getHyphenationTree
(String hyphenationFilename) Create a hyphenator treestatic HyphenationTree
getHyphenationTree
(InputSource hyphenationSource) Create a hyphenator treeMethods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, reset
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary) Creates a newHyphenationCompoundWordTokenFilter
instance.- Parameters:
input
- theTokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch) Creates a newHyphenationCompoundWordTokenFilter
instance.- Parameters:
input
- theTokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the stream
-
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize) Create a HyphenationCompoundWordTokenFilter with no dictionary. -
HyphenationCompoundWordTokenFilter
Create a HyphenationCompoundWordTokenFilter with no dictionary.
-
-
Method Details
-
getHyphenationTree
Create a hyphenator tree- Parameters:
hyphenationFilename
- the filename of the XML grammar to load- Returns:
- An object representing the hyphenation patterns
- Throws:
IOException
- If there is a low-level I/O error.
-
getHyphenationTree
Create a hyphenator tree- Parameters:
hyphenationSource
- the InputSource pointing to the XML grammar- Returns:
- An object representing the hyphenation patterns
- Throws:
IOException
- If there is a low-level I/O error.
-
decompose
protected void decompose()Description copied from class:CompoundWordTokenFilterBase
Decomposes the currentCompoundWordTokenFilterBase.termAtt
and placesCompoundWordTokenFilterBase.CompoundToken
instances in theCompoundWordTokenFilterBase.tokens
list. The original token may not be placed in the list, as it is automatically passed through this filter.- Specified by:
decompose
in classCompoundWordTokenFilterBase
-