|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter
public final class WordDelimiterFilter
Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
"Wi-Fi" → "Wi", "Fi""PowerShot" →
"Power", "Shot""SD500" →
"SD", "500""//hello---there, 'dude'" →
"hello", "there", "dude""O'Neil's"
→ "O", "Neil"
"PowerShot"
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)"PowerShot" →
0:"Power", 1:"Shot" 1:"PowerShot""A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!" →
0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
WordDelimiterFilter is to help match words with different
subword delimiters. For example, if the source text contained "wi-fi" one may
want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
is to specify combinations="1" in the analyzer used for indexing, and
combinations="0" (the default) in the analyzer used for querying. Given that
the current StandardTokenizer immediately removes many intra-word
delimiters, it is recommended that this filter be used after a tokenizer that
does not do this (such as WhitespaceTokenizer).
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
|---|
AttributeSource.AttributeFactory, AttributeSource.State |
| Field Summary | |
|---|---|
static int |
ALPHA
|
static int |
ALPHANUM
|
static int |
CATENATE_ALL
Causes all subword parts to be catenated: "wi-fi-4000" => "wifi4000" |
static int |
CATENATE_NUMBERS
Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi" |
static int |
CATENATE_WORDS
Causes maximum runs of word parts to be catenated: "wi-fi" => "wifi" |
static int |
DIGIT
|
static int |
GENERATE_NUMBER_PARTS
Causes number subwords to be generated: "500-42" => "500" "42" |
static int |
GENERATE_WORD_PARTS
Causes parts of words to be generated: "PowerShot" => "Power" "Shot" |
static int |
LOWER
|
static int |
PRESERVE_ORIGINAL
Causes original words are preserved and added to the subword list (Defaults to false) "500-42" => "500" "42" "500-42" |
static int |
SPLIT_ON_CASE_CHANGE
If not set, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens) |
static int |
SPLIT_ON_NUMERICS
If not set, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). |
static int |
STEM_ENGLISH_POSSESSIVE
Causes trailing "'s" to be removed for each subword "O'Neil's" => "O", "Neil" |
static int |
SUBWORD_DELIM
|
static int |
UPPER
|
| Fields inherited from class org.apache.lucene.analysis.TokenFilter |
|---|
input |
| Constructor Summary | |
|---|---|
WordDelimiterFilter(TokenStream in,
byte[] charTypeTable,
int configurationFlags,
CharArraySet protWords)
Creates a new WordDelimiterFilter |
|
WordDelimiterFilter(TokenStream in,
int configurationFlags,
CharArraySet protWords)
Creates a new WordDelimiterFilter using WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE
as its charTypeTable |
|
| Method Summary | |
|---|---|
boolean |
incrementToken()
|
void |
reset()
|
| Methods inherited from class org.apache.lucene.analysis.TokenFilter |
|---|
close, end |
| Methods inherited from class org.apache.lucene.util.AttributeSource |
|---|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final int LOWER
public static final int UPPER
public static final int DIGIT
public static final int SUBWORD_DELIM
public static final int ALPHA
public static final int ALPHANUM
public static final int GENERATE_WORD_PARTS
public static final int GENERATE_NUMBER_PARTS
public static final int CATENATE_WORDS
public static final int CATENATE_NUMBERS
public static final int CATENATE_ALL
public static final int PRESERVE_ORIGINAL
public static final int SPLIT_ON_CASE_CHANGE
public static final int SPLIT_ON_NUMERICS
public static final int STEM_ENGLISH_POSSESSIVE
| Constructor Detail |
|---|
public WordDelimiterFilter(TokenStream in,
byte[] charTypeTable,
int configurationFlags,
CharArraySet protWords)
in - TokenStream to be filteredcharTypeTable - table containing character typesconfigurationFlags - Flags configuring the filterprotWords - If not null is the set of tokens to protect from being delimited
public WordDelimiterFilter(TokenStream in,
int configurationFlags,
CharArraySet protWords)
WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE
as its charTypeTable
in - TokenStream to be filteredconfigurationFlags - Flags configuring the filterprotWords - If not null is the set of tokens to protect from being delimited| Method Detail |
|---|
public boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOException
public void reset()
throws IOException
reset in class TokenFilterIOException
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||