HyphenationCompoundWordTokenFilter (Lucene 3.6.2 API)

java.lang.Object
- org.apache.lucene.util.AttributeSource
- - org.apache.lucene.analysis.TokenStream
  - - org.apache.lucene.analysis.TokenFilter
    - - org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
      - org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter

All Implemented Interfaces:

Closeable
```
public class HyphenationCompoundWordTokenFilter
extends CompoundWordTokenFilterBase
```
A TokenFilter that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.
You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:
- As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
  CompoundWordTokenFilterBase.CompoundToken
- Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
  AttributeSource.AttributeFactory, AttributeSource.State

Field Summary
- Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
  DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
- Fields inherited from class org.apache.lucene.analysis.TokenFilter
  input

Constructor Summary

Constructors
Constructor and Description
`HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)` Deprecated. use `HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set)` instead.
`HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)` Deprecated. use `HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean)` instead.
`HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary)` Deprecated. use `HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[])` instead.
`HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)` Deprecated. use `HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean)` instead.
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator)` Create a HyphenationCompoundWordTokenFilter with no dictionary.
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)` Create a HyphenationCompoundWordTokenFilter with no dictionary.
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary)` Creates a new `HyphenationCompoundWordTokenFilter` instance.
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, Set<?> dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)` Creates a new `HyphenationCompoundWordTokenFilter` instance.
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary)` Deprecated. Use the constructors taking `Set`
`HyphenationCompoundWordTokenFilter(Version matchVersion, TokenStream input, HyphenationTree hyphenator, String[] dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)` Deprecated. Use the constructors taking `Set`

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`decompose()` Decomposes the current `CompoundWordTokenFilterBase.termAtt` and places `CompoundWordTokenFilterBase.CompoundToken` instances in the `CompoundWordTokenFilterBase.tokens` list.
`static HyphenationTree`	`getHyphenationTree(File hyphenationFile)` Create a hyphenator tree
`static HyphenationTree`	`getHyphenationTree(InputSource hyphenationSource)` Create a hyphenator tree
`static HyphenationTree`	`getHyphenationTree(Reader hyphenationReader)` Deprecated. Don't use Readers with fixed charset to load XML files, unless programatically created. Use `getHyphenationTree(InputSource)` instead, where you can supply default charset and input stream, if you like.
`static HyphenationTree`	`getHyphenationTree(String hyphenationFilename)` Create a hyphenator tree

Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, makeDictionary, reset

Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end

Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                             TokenStream input,
                                             HyphenationTree hyphenator,
                                             String[] dictionary,
                                             int minWordSize,
                                             int minSubwordSize,
                                             int maxSubwordSize,
                                             boolean onlyLongestMatch)
```
    Deprecated. Use the constructors taking Set
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against
    minWordSize - only words longer than this get processed
    minSubwordSize - only subwords longer than this get to the output stream
    maxSubwordSize - only subwords shorter than this get to the output stream
    onlyLongestMatch - Add only the longest matching subword to the stream
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                             TokenStream input,
                                             HyphenationTree hyphenator,
                                             String[] dictionary)
```
    Deprecated. Use the constructors taking Set
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against
  - HyphenationCompoundWordTokenFilter
```
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                  TokenStream input,
                                  HyphenationTree hyphenator,
                                  Set<?> dictionary)
```
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against.
  - HyphenationCompoundWordTokenFilter
```
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                  TokenStream input,
                                  HyphenationTree hyphenator,
                                  Set<?> dictionary,
                                  int minWordSize,
                                  int minSubwordSize,
                                  int maxSubwordSize,
                                  boolean onlyLongestMatch)
```
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against.
    minWordSize - only words longer than this get processed
    minSubwordSize - only subwords longer than this get to the output stream
    maxSubwordSize - only subwords shorter than this get to the output stream
    onlyLongestMatch - Add only the longest matching subword to the stream
  - HyphenationCompoundWordTokenFilter
```
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                  TokenStream input,
                                  HyphenationTree hyphenator,
                                  int minWordSize,
                                  int minSubwordSize,
                                  int maxSubwordSize)
```
    Create a HyphenationCompoundWordTokenFilter with no dictionary.
    Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, null, minWordSize, minSubwordSize, maxSubwordSize
  - HyphenationCompoundWordTokenFilter
```
public HyphenationCompoundWordTokenFilter(Version matchVersion,
                                  TokenStream input,
                                  HyphenationTree hyphenator)
```
    Create a HyphenationCompoundWordTokenFilter with no dictionary.
    Calls HyphenationCompoundWordTokenFilter(matchVersion, input, hyphenator, DEFAULT_MIN_WORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MAX_SUBWORD_SIZE
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator,
                                             String[] dictionary,
                                             int minWordSize,
                                             int minSubwordSize,
                                             int maxSubwordSize,
                                             boolean onlyLongestMatch)
```
    Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[], int, int, int, boolean) instead.
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against
    minWordSize - only words longer than this get processed
    minSubwordSize - only subwords longer than this get to the output stream
    maxSubwordSize - only subwords shorter than this get to the output stream
    onlyLongestMatch - Add only the longest matching subword to the stream
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator,
                                             String[] dictionary)
```
    Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, String[]) instead.
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator,
                                             Set<?> dictionary)
```
    Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set) instead.
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
  - HyphenationCompoundWordTokenFilter
```
@Deprecated
public HyphenationCompoundWordTokenFilter(TokenStream input,
                                             HyphenationTree hyphenator,
                                             Set<?> dictionary,
                                             int minWordSize,
                                             int minSubwordSize,
                                             int maxSubwordSize,
                                             boolean onlyLongestMatch)
```
    Deprecated. use HyphenationCompoundWordTokenFilter(Version, TokenStream, HyphenationTree, Set, int, int, int, boolean) instead.
    
    Creates a new HyphenationCompoundWordTokenFilter instance.
    
    Parameters:
    input - the TokenStream to process
    hyphenator - the hyphenation pattern tree to use for hyphenation
    dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
    minWordSize - only words longer than this get processed
    minSubwordSize - only subwords longer than this get to the output stream
    maxSubwordSize - only subwords shorter than this get to the output stream
    onlyLongestMatch - Add only the longest matching subword to the stream
- Method Detail
  - getHyphenationTree
```
public static HyphenationTree getHyphenationTree(String hyphenationFilename)
                                          throws Exception
```
    Create a hyphenator tree
    
    Parameters:
    hyphenationFilename - the filename of the XML grammar to load
    
    Returns:
    An object representing the hyphenation patterns
    
    Throws:
    
    Exception
  - getHyphenationTree
```
public static HyphenationTree getHyphenationTree(File hyphenationFile)
                                          throws Exception
```
    Create a hyphenator tree
    
    Parameters:
    hyphenationFile - the file of the XML grammar to load
    
    Returns:
    An object representing the hyphenation patterns
    
    Throws:
    
    Exception
  - getHyphenationTree
```
@Deprecated
public static HyphenationTree getHyphenationTree(Reader hyphenationReader)
                                          throws Exception
```
    Deprecated. Don't use Readers with fixed charset to load XML files, unless programatically created. Use getHyphenationTree(InputSource) instead, where you can supply default charset and input stream, if you like.
    
    Create a hyphenator tree
    
    Parameters:
    hyphenationReader - the reader of the XML grammar to load from
    
    Returns:
    An object representing the hyphenation patterns
    
    Throws:
    
    Exception
  - getHyphenationTree
```
public static HyphenationTree getHyphenationTree(InputSource hyphenationSource)
                                          throws Exception
```
    Create a hyphenator tree
    
    Parameters:
    hyphenationSource - the InputSource pointing to the XML grammar
    
    Returns:
    An object representing the hyphenation patterns
    
    Throws:
    
    Exception
  - decompose
```
protected void decompose()
```
    Description copied from class: CompoundWordTokenFilterBase
    
    Decomposes the current CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. The original token may not be placed in the list, as it is automatically passed through this filter.
    
    Specified by:
    
    decompose in class CompoundWordTokenFilterBase

Class HyphenationCompoundWordTokenFilter

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

Field Summary

Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Fields inherited from class org.apache.lucene.analysis.TokenFilter

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase

Methods inherited from class org.apache.lucene.analysis.TokenFilter

Methods inherited from class org.apache.lucene.util.AttributeSource

Methods inherited from class java.lang.Object

Constructor Detail

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

HyphenationCompoundWordTokenFilter

Method Detail

getHyphenationTree

getHyphenationTree

getHyphenationTree

getHyphenationTree

decompose