org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter

All Implemented Interfaces:: Closeable, AutoCloseable, Unwrappable<TokenStream>

public class HyphenationCompoundWordTokenFilter extends CompoundWordTokenFilterBase

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundToken

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
Field Summary

Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens

Fields inherited from class org.apache.lucene.analysis.TokenFilter
input

Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor Summary

Constructors

Constructor

Description

HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator)

Create a HyphenationCompoundWordTokenFilter with no dictionary.

HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)

Create a HyphenationCompoundWordTokenFilter with no dictionary.

HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)

Creates a new HyphenationCompoundWordTokenFilter instance.

HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)

Creates a new HyphenationCompoundWordTokenFilter instance.
Method Summary

Modifier and Type

Method

Description

protected void

decompose()

Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list.

static HyphenationTree

getHyphenationTree(String hyphenationFilename)

Create a hyphenator tree

static HyphenationTree

getHyphenationTree(InputSource hyphenationSource)

Create a hyphenator tree

Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, reset

Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrap

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- HyphenationCompoundWordTokenFilter
  
  public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)
  
  Creates a new HyphenationCompoundWordTokenFilter instance.
  
  Parameters:
  
  input - the TokenStream to process
  
  hyphenator - the hyphenation pattern tree to use for hyphenation
  
  dictionary - the word dictionary to match against.
- HyphenationCompoundWordTokenFilter
  
  public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
  
  Creates a new HyphenationCompoundWordTokenFilter instance.
  
  Parameters:
  
  input - the TokenStream to process
  
  hyphenator - the hyphenation pattern tree to use for hyphenation
  
  dictionary - the word dictionary to match against.
  
  minWordSize - only words longer than this get processed
  
  minSubwordSize - only subwords longer than this get to the output stream
  
  maxSubwordSize - only subwords shorter than this get to the output stream
  
  onlyLongestMatch - Add only the longest matching subword to the stream
- HyphenationCompoundWordTokenFilter
  
  public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
  
  Create a HyphenationCompoundWordTokenFilter with no dictionary.
  Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, null, minWordSize, minSubwordSize, maxSubwordSize
- HyphenationCompoundWordTokenFilter
  
  public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator)
  
  Create a HyphenationCompoundWordTokenFilter with no dictionary.
  Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, DEFAULT_MIN_WORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE
Method Details
- getHyphenationTree
  
  public static HyphenationTree getHyphenationTree(String hyphenationFilename) throws IOException
  
  Create a hyphenator tree
  
  Parameters:
  
  hyphenationFilename - the filename of the XML grammar to load
  
  Returns:
  
  An object representing the hyphenation patterns
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getHyphenationTree
  
  public static HyphenationTree getHyphenationTree(InputSource hyphenationSource) throws IOException
  
  Create a hyphenator tree
  
  Parameters:
  
  hyphenationSource - the InputSource pointing to the XML grammar
  
  Returns:
  
  An object representing the hyphenation patterns
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- decompose
  
  protected void decompose()
  
  Description copied from class: CompoundWordTokenFilterBase
  
  Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. The original token may not be placed in the list, as it is automatically passed through this filter.
  
  Specified by:
  
  decompose in class CompoundWordTokenFilterBase

Class HyphenationCompoundWordTokenFilter

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Fields inherited from class org.apache.lucene.analysis.TokenFilter

Fields inherited from class org.apache.lucene.analysis.TokenStream

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Methods inherited from class org.apache.lucene.analysis.TokenFilter

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Constructor Details

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

Method Details

getHyphenationTree

getHyphenationTree

decompose