WordDelimiterFilter (Lucene 4.7.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.miscellaneous
Class WordDelimiterFilter

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.TokenFilter
              org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter

All Implemented Interfaces:: Closeable

public final class WordDelimiterFilter
extends TokenFilter
extends TokenFilter

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations: "PowerShot" → 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:
- "PowerShot" → 0:"Power", 1:"Shot" 1:"PowerShot"
- "A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC"
- "Super-Duper-XL500-42-AutoCoder!" → 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static int`	`ALPHA`
`static int`	`ALPHANUM`
`static int`	`CATENATE_ALL` Causes all subword parts to be catenated: "wi-fi-4000" => "wifi4000"
`static int`	`CATENATE_NUMBERS` Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi"
`static int`	`CATENATE_WORDS` Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi"
`static int`	`DIGIT`
`static int`	`GENERATE_NUMBER_PARTS` Causes number subwords to be generated: "500-42" => "500" "42"
`static int`	`GENERATE_WORD_PARTS` Causes parts of words to be generated: "PowerShot" => "Power" "Shot"
`static int`	`LOWER`
`static int`	`PRESERVE_ORIGINAL` Causes original words are preserved and added to the subword list (Defaults to false) "500-42" => "500" "42" "500-42"
`static int`	`SPLIT_ON_CASE_CHANGE` If not set, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens)
`static int`	`SPLIT_ON_NUMERICS` If not set, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).
`static int`	`STEM_ENGLISH_POSSESSIVE` Causes trailing "'s" to be removed for each subword "O'Neil's" => "O", "Neil"
`static int`	`SUBWORD_DELIM`
`static int`	`UPPER`

Fields inherited from class org.apache.lucene.analysis.TokenFilter
`input`

Constructor Summary
`WordDelimiterFilter(TokenStream in, byte[] charTypeTable, int configurationFlags, CharArraySet protWords)` Creates a new WordDelimiterFilter
`WordDelimiterFilter(TokenStream in, int configurationFlags, CharArraySet protWords)` Creates a new WordDelimiterFilter using `WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE` as its charTypeTable

Method Summary
`boolean`	`incrementToken()`
`void`	`reset()`

Methods inherited from class org.apache.lucene.analysis.TokenFilter
`close, end`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

LOWER

public static final int LOWER

See Also:: Constant Field Values

UPPER

public static final int UPPER

See Also:: Constant Field Values

DIGIT

public static final int DIGIT

See Also:: Constant Field Values

SUBWORD_DELIM

public static final int SUBWORD_DELIM

See Also:: Constant Field Values

ALPHA

public static final int ALPHA

See Also:: Constant Field Values

ALPHANUM

public static final int ALPHANUM

See Also:: Constant Field Values

GENERATE_WORD_PARTS

public static final int GENERATE_WORD_PARTS

Causes parts of words to be generated:

"PowerShot" => "Power" "Shot"

See Also:: Constant Field Values

GENERATE_NUMBER_PARTS

public static final int GENERATE_NUMBER_PARTS

Causes number subwords to be generated:

"500-42" => "500" "42"

See Also:: Constant Field Values

CATENATE_WORDS

public static final int CATENATE_WORDS

Causes maximum runs of word parts to be catenated:

"wi-fi" => "wifi"

See Also:: Constant Field Values

CATENATE_NUMBERS

public static final int CATENATE_NUMBERS

Causes maximum runs of word parts to be catenated:

"wi-fi" => "wifi"

See Also:: Constant Field Values

CATENATE_ALL

public static final int CATENATE_ALL

Causes all subword parts to be catenated:

"wi-fi-4000" => "wifi4000"

See Also:: Constant Field Values

PRESERVE_ORIGINAL

public static final int PRESERVE_ORIGINAL

Causes original words are preserved and added to the subword list (Defaults to false)

"500-42" => "500" "42" "500-42"

See Also:: Constant Field Values

SPLIT_ON_CASE_CHANGE

public static final int SPLIT_ON_CASE_CHANGE

If not set, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens)

See Also:: Constant Field Values

SPLIT_ON_NUMERICS

public static final int SPLIT_ON_NUMERICS

If not set, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).

See Also:: Constant Field Values

STEM_ENGLISH_POSSESSIVE

public static final int STEM_ENGLISH_POSSESSIVE

Causes trailing "'s" to be removed for each subword

"O'Neil's" => "O", "Neil"

See Also:: Constant Field Values

Constructor Detail