org.apache.lucene.analysis.compound
Class HyphenationCompoundWordTokenFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
                  extended by org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
All Implemented Interfaces:
Closeable

public class HyphenationCompoundWordTokenFilter
extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.

You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, matchVersion, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)
          Create a HyphenationCompoundWordTokenFilter with no dictionary.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)
          Creates a new HyphenationCompoundWordTokenFilter instance.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new HyphenationCompoundWordTokenFilter instance.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
          Create a HyphenationCompoundWordTokenFilter with no dictionary.
 
Method Summary
protected  void decompose()
          Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list.
static HyphenationTree getHyphenationTree(File hyphenationFile)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(String hyphenationFilename)
          Create a hyphenator tree
 
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          CharArraySet dictionary)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against.

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          CharArraySet dictionary,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize,
                                          boolean onlyLongestMatch)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary.

Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, null, minWordSize, minSubwordSize, maxSubwordSize


HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary.

Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, DEFAULT_MIN_WORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE

Method Detail

getHyphenationTree

public static HyphenationTree getHyphenationTree(String hyphenationFilename)
                                          throws IOException
Create a hyphenator tree

Parameters:
hyphenationFilename - the filename of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
IOException - If there is a low-level I/O error.

getHyphenationTree

public static HyphenationTree getHyphenationTree(File hyphenationFile)
                                          throws IOException
Create a hyphenator tree

Parameters:
hyphenationFile - the file of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
IOException - If there is a low-level I/O error.

getHyphenationTree

public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
                                          throws IOException
Create a hyphenator tree

Parameters:
hyphenationSource - the InputSource pointing to the XML grammar
Returns:
An object representing the hyphenation patterns
Throws:
IOException - If there is a low-level I/O error.

decompose

protected void decompose()
Description copied from class: CompoundWordTokenFilterBase
Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. The original token may not be placed in the list, as it is automatically passed through this filter.

Specified by:
decompose in class CompoundWordTokenFilterBase


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.