org.apache.lucene.analysis.compound
Class HyphenationCompoundWordTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
- All Implemented Interfaces:
- Closeable
public class HyphenationCompoundWordTokenFilter
- extends CompoundWordTokenFilterBase
A TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation
grammar and a word dictionary to achieve this.
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
input |
Constructor Summary |
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
Set dictionary)
|
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
|
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
|
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
- Parameters:
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match againstminWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output
streammaxSubwordSize
- only subwords shorter than this get to the output
streamonlyLongestMatch
- Add only the longest matching subword to the stream
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
- Parameters:
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
Set dictionary)
- Parameters:
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.
HyphenationCompoundWordTokenFilter
public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
HyphenationTree hyphenator,
Set dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
- Parameters:
input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against. If this is a CharArraySet
it must have set ignoreCase=false and only contain
lower case strings.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output
streammaxSubwordSize
- only subwords shorter than this get to the output
streamonlyLongestMatch
- Add only the longest matching subword to the stream
getHyphenationTree
public static HyphenationTree getHyphenationTree(String hyphenationFilename)
throws Exception
- Create a hyphenator tree
- Parameters:
hyphenationFilename
- the filename of the XML grammar to load
- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
getHyphenationTree
public static HyphenationTree getHyphenationTree(File hyphenationFile)
throws Exception
- Create a hyphenator tree
- Parameters:
hyphenationFile
- the file of the XML grammar to load
- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
getHyphenationTree
public static HyphenationTree getHyphenationTree(Reader hyphenationReader)
throws Exception
- Create a hyphenator tree
- Parameters:
hyphenationReader
- the reader of the XML grammar to load from
- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
getHyphenationTree
public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
throws Exception
- Create a hyphenator tree
- Parameters:
hyphenationSource
- the InputSource pointing to the XML grammar
- Returns:
- An object representing the hyphenation patterns
- Throws:
Exception
decomposeInternal
protected void decomposeInternal(org.apache.lucene.analysis.Token token)
- Specified by:
decomposeInternal
in class CompoundWordTokenFilterBase
Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.