org.apache.lucene.analysis.compound
Class HyphenationCompoundWordTokenFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
                  extended by org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter
All Implemented Interfaces:
Closeable

public class HyphenationCompoundWordTokenFilter
extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, onlyLongestMatch, tokens
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)
          Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set) instead.
HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean) instead.
HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary)
          Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[]) instead.
HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean) instead.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)
          Create a HyphenationCompoundWordTokenFilter with no dictionary.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
          Create a HyphenationCompoundWordTokenFilter with no dictionary.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)
          Creates a new HyphenationCompoundWordTokenFilter instance.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new HyphenationCompoundWordTokenFilter instance.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary)
          Creates a new HyphenationCompoundWordTokenFilter instance.
HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
          Creates a new HyphenationCompoundWordTokenFilter instance.
 
Method Summary
protected  void decomposeInternal(Token token)
           
static HyphenationTree getHyphenationTree(File hyphenationFile)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
          Create a hyphenator tree
static HyphenationTree getHyphenationTree(Reader hyphenationReader)
          Deprecated. Don't use Readers with fixed charset to load XML files, unless programatically created. Use getHyphenationTree(InputSource) instead, where you can supply default charset and input stream, if you like.
static HyphenationTree getHyphenationTree(String hyphenationFilename)
          Create a hyphenator tree
 
Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
addAllLowerCase, createToken, decompose, incrementToken, makeDictionary, makeDictionary, makeLowerCaseCopy, reset
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          String[] dictionary,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize,
                                          boolean onlyLongestMatch)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          String[] dictionary)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          Set<?> dictionary)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          Set<?> dictionary,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize,
                                          boolean onlyLongestMatch)
Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator,
                                          int minWordSize,
                                          int minSubwordSize,
                                          int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary.

Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, null, minWordSize, minSubwordSize, maxSubwordSize


HyphenationCompoundWordTokenFilter

public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                          TokenStream input,
                                          HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary.

Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, DEFAULT_MIN_WORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE


HyphenationCompoundWordTokenFilter

@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                                     HyphenationTree hyphenator,
                                                     String[] dictionary,
                                                     int minWordSize,
                                                     int minSubwordSize,
                                                     int maxSubwordSize,
                                                     boolean onlyLongestMatch)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean) instead.

Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream

HyphenationCompoundWordTokenFilter

@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                                     HyphenationTree hyphenator,
                                                     String[] dictionary)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[]) instead.

Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against

HyphenationCompoundWordTokenFilter

@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                                     HyphenationTree hyphenator,
                                                     Set<?> dictionary)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set) instead.

Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.

HyphenationCompoundWordTokenFilter

@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                                     HyphenationTree hyphenator,
                                                     Set<?> dictionary,
                                                     int minWordSize,
                                                     int minSubwordSize,
                                                     int maxSubwordSize,
                                                     boolean onlyLongestMatch)
Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean) instead.

Creates a new HyphenationCompoundWordTokenFilter instance.

Parameters:
input - the TokenStream to process
hyphenator - the hyphenation pattern tree to use for hyphenation
dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
minWordSize - only words longer than this get processed
minSubwordSize - only subwords longer than this get to the output stream
maxSubwordSize - only subwords shorter than this get to the output stream
onlyLongestMatch - Add only the longest matching subword to the stream
Method Detail

getHyphenationTree

public static HyphenationTree getHyphenationTree(String hyphenationFilename)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationFilename - the filename of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

public static HyphenationTree getHyphenationTree(File hyphenationFile)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationFile - the file of the XML grammar to load
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

@Deprecated
public static HyphenationTree getHyphenationTree(Reader hyphenationReader)
                                          throws Exception
Deprecated. Don't use Readers with fixed charset to load XML files, unless programatically created. Use getHyphenationTree(InputSource) instead, where you can supply default charset and input stream, if you like.

Create a hyphenator tree

Parameters:
hyphenationReader - the reader of the XML grammar to load from
Returns:
An object representing the hyphenation patterns
Throws:
Exception

getHyphenationTree

public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
                                          throws Exception
Create a hyphenator tree

Parameters:
hyphenationSource - the InputSource pointing to the XML grammar
Returns:
An object representing the hyphenation patterns
Throws:
Exception

decomposeInternal

protected void decomposeInternal(Token token)
Specified by:
decomposeInternal in class CompoundWordTokenFilterBase


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.