|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
public class HyphenationCompoundWordTokenFilter
A TokenFilter that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
You must specify the required Version compatibility when creating
CompoundWordTokenFilterBase:
If you pass in a CharArraySet as dictionary,
it should be case-insensitive unless it contains only lowercased entries and you
have LowerCaseFilter before this filter in your analysis chain.
For optional performance (as this filter does lots of lookups to the dictionary,
you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary
Sets to the ctors or String[] dictionaries, they will be automatically
transformed to case-insensitive!
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
|---|
CompoundWordTokenFilterBase.CompoundToken |
| Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
|---|
AttributeSource.AttributeFactory, AttributeSource.State |
| Field Summary |
|---|
| Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
|---|
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens |
| Fields inherited from class org.apache.lucene.analysis.TokenFilter |
|---|
input |
| Constructor Summary | |
|---|---|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set) instead. |
|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean) instead. |
|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[]) instead. |
|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean) instead. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
int minWordSize,
int minSubwordSize,
int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary)
Creates a new HyphenationCompoundWordTokenFilter instance. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new HyphenationCompoundWordTokenFilter instance. |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
Deprecated. Use the constructors taking Set |
|
HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Deprecated. Use the constructors taking Set |
|
| Method Summary | |
|---|---|
protected void |
decompose()
Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. |
static HyphenationTree |
getHyphenationTree(File hyphenationFile)
Create a hyphenator tree |
static HyphenationTree |
getHyphenationTree(InputSource hyphenationSource)
Create a hyphenator tree |
static HyphenationTree |
getHyphenationTree(Reader hyphenationReader)
Deprecated. Don't use Readers with fixed charset to load XML files, unless programatically created. Use getHyphenationTree(InputSource) instead, where you can supply default charset and input
stream, if you like. |
static HyphenationTree |
getHyphenationTree(String hyphenationFilename)
Create a hyphenator tree |
| Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase |
|---|
incrementToken, makeDictionary, reset |
| Methods inherited from class org.apache.lucene.analysis.TokenFilter |
|---|
close, end |
| Methods inherited from class org.apache.lucene.util.AttributeSource |
|---|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
@Deprecated
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Set
HyphenationCompoundWordTokenFilter instance.
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match againstminWordSize - only words longer than this get processedminSubwordSize - only subwords longer than this get to the output streammaxSubwordSize - only subwords shorter than this get to the output streamonlyLongestMatch - Add only the longest matching subword to the stream
@Deprecated
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
Set
HyphenationCompoundWordTokenFilter instance.
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary)
HyphenationCompoundWordTokenFilter instance.
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against.
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
HyphenationCompoundWordTokenFilter instance.
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the
dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against.minWordSize - only words longer than this get processedminSubwordSize - only subwords longer than this get to the output streammaxSubwordSize - only subwords shorter than this get to the output streamonlyLongestMatch - Add only the longest matching subword to the stream
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator,
int minWordSize,
int minSubwordSize,
int maxSubwordSize)
public HyphenationCompoundWordTokenFilter(Version matchVersion,
TokenStream input,
HyphenationTree hyphenator)
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
String[] dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean) instead.
HyphenationCompoundWordTokenFilter instance.
input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match againstminWordSize - only words longer than this get processedminSubwordSize - only subwords longer than this get to the output
streammaxSubwordSize - only subwords shorter than this get to the output
streamonlyLongestMatch - Add only the longest matching subword to the stream
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
String[] dictionary)
HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[]) instead.
HyphenationCompoundWordTokenFilter instance.
input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary)
HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set) instead.
HyphenationCompoundWordTokenFilter instance.
input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain
lower case strings.
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
Set<?> dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean) instead.
HyphenationCompoundWordTokenFilter instance.
input - the TokenStream to processhyphenator - the hyphenation pattern tree to use for hyphenationdictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain
lower case strings.minWordSize - only words longer than this get processedminSubwordSize - only subwords longer than this get to the output
streammaxSubwordSize - only subwords shorter than this get to the output
streamonlyLongestMatch - Add only the longest matching subword to the stream| Method Detail |
|---|
public static HyphenationTree getHyphenationTree(String hyphenationFilename)
throws Exception
hyphenationFilename - the filename of the XML grammar to load
Exception
public static HyphenationTree getHyphenationTree(File hyphenationFile)
throws Exception
hyphenationFile - the file of the XML grammar to load
Exception
@Deprecated
public static HyphenationTree getHyphenationTree(Reader hyphenationReader)
throws Exception
getHyphenationTree(InputSource) instead, where you can supply default charset and input
stream, if you like.
hyphenationReader - the reader of the XML grammar to load from
Exception
public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
throws Exception
hyphenationSource - the InputSource pointing to the XML grammar
Exceptionprotected void decompose()
CompoundWordTokenFilterBaseCompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list.
The original token may not be placed in the list, as it is automatically passed through this filter.
decompose in class CompoundWordTokenFilterBase
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||