org.apache.lucene.analysis.compound
Class HyphenationCompoundWordTokenFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
                  extended by org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter

public class HyphenationCompoundWordTokenFilter
extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, onlyLongestMatch, tokens
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, Set dictionary)
           
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, Set dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
           
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, String[] dictionary)
           
HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
           
 
Method Summary
protected  void decomposeInternal(org.apache.lucene.analysis.Token token)
           
static HyphenationTree getHyphenationTree(File hyphenationFile)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(Reader hyphenationReader)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(String hyphenationFilename)
          Create a hyphenator tree
 
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
addAllLowerCase, createToken, decompose, incrementToken, makeDictionary, makeLowerCaseCopy, next, next, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
getOnlyUseNewAPI, setOnlyUseNewAPI
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                          HyphenationTree hyphenator,
                                          String[] dictionary,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize,
                                          boolean onlyLongestMatch)
Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                          HyphenationTree hyphenator,
                                          String[] dictionary)
Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                          HyphenationTree hyphenator,
                                          Set dictionary)
Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(org.apache.lucene.analysis.TokenStream input,
                                          HyphenationTree hyphenator,
                                          Set dictionary,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize,
                                          boolean onlyLongestMatch)
Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream
Method Detail

getHyphenationTree

public static HyphenationTree getHyphenationTree(String hyphenationFilename)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationFilename - the filename of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

public static HyphenationTree getHyphenationTree(File hyphenationFile)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationFile - the file of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

public static HyphenationTree getHyphenationTree(Reader hyphenationReader)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationReader - the reader of the XML grammar to load from
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationSource - the InputSource pointing to the XML grammar
Returns:
An object representing the hyphenation patterns
Throws:
Exception

decomposeInternal

protected void decomposeInternal(org.apache.lucene.analysis.Token token)
Specified by:
decomposeInternal in class CompoundWordTokenFilterBase


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.