org.apache.lucene.search.spell
Class DirectSpellChecker

java.lang.Object
  extended by org.apache.lucene.search.spell.DirectSpellChecker

public class DirectSpellChecker
extends Object

Simple automaton-based spellchecker.

Candidates are presented directly from the term dictionary, based on Levenshtein distance. This is an alternative to SpellChecker if you are using an edit-distance-like metric such as Levenshtein or JaroWinklerDistance.

A practical benefit of this spellchecker is that it requires no additional datastructures (neither in RAM nor on disk) to do its work.

See Also:
LevenshteinAutomata, FuzzyTermsEnum
WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
protected static class DirectSpellChecker.ScoreTerm
          Holds a spelling correction for internal usage inside DirectSpellChecker.
 
Field Summary
static StringDistance INTERNAL_LEVENSHTEIN
          The default StringDistance, Damerau-Levenshtein distance implemented internally via LevenshteinAutomata.
 
Constructor Summary
DirectSpellChecker()
          Creates a DirectSpellChecker with default configuration values
 
Method Summary
 float getAccuracy()
          Get the minimal accuracy from the StringDistance for a match
 Comparator<SuggestWord> getComparator()
          Get the current comparator in use.
 StringDistance getDistance()
          Get the string distance metric in use.
 boolean getLowerCaseTerms()
          true if the spellchecker should lowercase terms
 int getMaxEdits()
          Get the maximum number of Levenshtein edit-distances to draw candidate terms from.
 int getMaxInspections()
          Get the maximum number of top-N inspections per suggestion
 float getMaxQueryFrequency()
          Get the maximum threshold of documents a query term can appear in order to provide suggestions.
 int getMinPrefix()
          Get the minimal number of characters that must match exactly
 int getMinQueryLength()
          Get the minimum length of a query term needed to return suggestions
 float getThresholdFrequency()
          Get the minimal threshold of documents a term must appear for a match
 void setAccuracy(float accuracy)
          Set the minimal accuracy required (default: 0.5f) from a StringDistance for a suggestion match.
 void setComparator(Comparator<SuggestWord> comparator)
          Set the comparator for sorting suggestions.
 void setDistance(StringDistance distance)
          Set the string distance metric.
 void setLowerCaseTerms(boolean lowerCaseTerms)
          True if the spellchecker should lowercase terms (default: true)
 void setMaxEdits(int maxEdits)
          Sets the maximum number of Levenshtein edit-distances to draw candidate terms from.
 void setMaxInspections(int maxInspections)
          Set the maximum number of top-N inspections (default: 5) per suggestion.
 void setMaxQueryFrequency(float maxQueryFrequency)
          Set the maximum threshold (default: 0.01f) of documents a query term can appear in order to provide suggestions.
 void setMinPrefix(int minPrefix)
          Sets the minimal number of initial characters (default: 1) that must match exactly.
 void setMinQueryLength(int minQueryLength)
          Set the minimum length of a query term (default: 4) needed to return suggestions.
 void setThresholdFrequency(float thresholdFrequency)
          Set the minimal threshold of documents a term must appear for a match.
 SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir)
          Calls suggestSimilar(term, numSug, ir, SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX)
protected  Collection<DirectSpellChecker.ScoreTerm> suggestSimilar(Term term, int numSug, IndexReader ir, int docfreq, int editDistance, float accuracy, CharsRef spare)
          Provide spelling corrections based on several parameters.
 SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode)
          Calls suggestSimilar(term, numSug, ir, suggestMode, this.accuracy)
 SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode, float accuracy)
          Suggest similar words.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INTERNAL_LEVENSHTEIN

public static final StringDistance INTERNAL_LEVENSHTEIN
The default StringDistance, Damerau-Levenshtein distance implemented internally via LevenshteinAutomata.

Note: this is the fastest distance metric, because Damerau-Levenshtein is used to draw candidates from the term dictionary: this just re-uses the scoring.

Constructor Detail

DirectSpellChecker

public DirectSpellChecker()
Creates a DirectSpellChecker with default configuration values

Method Detail

getMaxEdits

public int getMaxEdits()
Get the maximum number of Levenshtein edit-distances to draw candidate terms from.


setMaxEdits

public void setMaxEdits(int maxEdits)
Sets the maximum number of Levenshtein edit-distances to draw candidate terms from. This value can be 1 or 2. The default is 2.

Note: a large number of spelling errors occur with an edit distance of 1, by setting this value to 1 you can increase both performance and precision at the cost of recall.


getMinPrefix

public int getMinPrefix()
Get the minimal number of characters that must match exactly


setMinPrefix

public void setMinPrefix(int minPrefix)
Sets the minimal number of initial characters (default: 1) that must match exactly.

This can improve both performance and accuracy of results, as misspellings are commonly not the first character.


getMaxInspections

public int getMaxInspections()
Get the maximum number of top-N inspections per suggestion


setMaxInspections

public void setMaxInspections(int maxInspections)
Set the maximum number of top-N inspections (default: 5) per suggestion.

Increasing this number can improve the accuracy of results, at the cost of performance.


getAccuracy

public float getAccuracy()
Get the minimal accuracy from the StringDistance for a match


setAccuracy

public void setAccuracy(float accuracy)
Set the minimal accuracy required (default: 0.5f) from a StringDistance for a suggestion match.


getThresholdFrequency

public float getThresholdFrequency()
Get the minimal threshold of documents a term must appear for a match


setThresholdFrequency

public void setThresholdFrequency(float thresholdFrequency)
Set the minimal threshold of documents a term must appear for a match.

This can improve quality by only suggesting high-frequency terms. Note that very high values might decrease performance slightly, by forcing the spellchecker to draw more candidates from the term dictionary, but a practical value such as 1 can be very useful towards improving quality.

This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.


getMinQueryLength

public int getMinQueryLength()
Get the minimum length of a query term needed to return suggestions


setMinQueryLength

public void setMinQueryLength(int minQueryLength)
Set the minimum length of a query term (default: 4) needed to return suggestions.

Very short query terms will often cause only bad suggestions with any distance metric.


getMaxQueryFrequency

public float getMaxQueryFrequency()
Get the maximum threshold of documents a query term can appear in order to provide suggestions.


setMaxQueryFrequency

public void setMaxQueryFrequency(float maxQueryFrequency)
Set the maximum threshold (default: 0.01f) of documents a query term can appear in order to provide suggestions.

Very high-frequency terms are typically spelled correctly. Additionally, this can increase performance as it will do no work for the common case of correctly-spelled input terms.

This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.


getLowerCaseTerms

public boolean getLowerCaseTerms()
true if the spellchecker should lowercase terms


setLowerCaseTerms

public void setLowerCaseTerms(boolean lowerCaseTerms)
True if the spellchecker should lowercase terms (default: true)

This is a convenience method, if your index field has more complicated analysis (such as StandardTokenizer removing punctuation), its probably better to turn this off, and instead run your query terms through your Analyzer first.

If this option is not on, case differences count as an edit!


getComparator

public Comparator<SuggestWord> getComparator()
Get the current comparator in use.


setComparator

public void setComparator(Comparator<SuggestWord> comparator)
Set the comparator for sorting suggestions. The default is SuggestWordQueue.DEFAULT_COMPARATOR


getDistance

public StringDistance getDistance()
Get the string distance metric in use.


setDistance

public void setDistance(StringDistance distance)
Set the string distance metric. The default is INTERNAL_LEVENSHTEIN

Note: because this spellchecker draws its candidates from the term dictionary using Damerau-Levenshtein, it works best with an edit-distance-like string metric. If you use a different metric than the default, you might want to consider increasing setMaxInspections(int) to draw more candidates for your metric to rank.


suggestSimilar

public SuggestWord[] suggestSimilar(Term term,
                                    int numSug,
                                    IndexReader ir)
                             throws IOException
Calls suggestSimilar(term, numSug, ir, SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX)

Throws:
IOException

suggestSimilar

public SuggestWord[] suggestSimilar(Term term,
                                    int numSug,
                                    IndexReader ir,
                                    SuggestMode suggestMode)
                             throws IOException
Calls suggestSimilar(term, numSug, ir, suggestMode, this.accuracy)

Throws:
IOException

suggestSimilar

public SuggestWord[] suggestSimilar(Term term,
                                    int numSug,
                                    IndexReader ir,
                                    SuggestMode suggestMode,
                                    float accuracy)
                             throws IOException
Suggest similar words.

Unlike SpellChecker, the similarity used to fetch the most relevant terms is an edit distance, therefore typically a low value for numSug will work very well.

Parameters:
term - Term you want to spell check on
numSug - the maximum number of suggested words
ir - IndexReader to find terms from
suggestMode - specifies when to return suggested words
accuracy - return only suggested words that match with this similarity
Returns:
sorted list of the suggested words according to the comparator
Throws:
IOException - If there is a low-level I/O error.

suggestSimilar

protected Collection<DirectSpellChecker.ScoreTerm> suggestSimilar(Term term,
                                                                  int numSug,
                                                                  IndexReader ir,
                                                                  int docfreq,
                                                                  int editDistance,
                                                                  float accuracy,
                                                                  CharsRef spare)
                                                           throws IOException
Provide spelling corrections based on several parameters.

Parameters:
term - The term to suggest spelling corrections for
numSug - The maximum number of spelling corrections
ir - The index reader to fetch the candidate spelling corrections from
docfreq - The minimum document frequency a potential suggestion need to have in order to be included
editDistance - The maximum edit distance candidates are allowed to have
accuracy - The minimum accuracy a suggested spelling correction needs to have in order to be included
spare - a chars scratch
Returns:
a collection of spelling corrections sorted by ScoreTerm's natural order.
Throws:
IOException - If I/O related errors occur


Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.