public class DirectSpellChecker extends Object
Candidates are presented directly from the term dictionary, based on
Levenshtein distance. This is an alternative to SpellChecker
if you are using an edit-distance-like metric such as Levenshtein
or JaroWinklerDistance
.
A practical benefit of this spellchecker is that it requires no additional datastructures (neither in RAM nor on disk) to do its work.
LevenshteinAutomata
,
FuzzyTermsEnum
Modifier and Type | Class and Description |
---|---|
protected static class |
DirectSpellChecker.ScoreTerm
Holds a spelling correction for internal usage inside
DirectSpellChecker . |
Modifier and Type | Field and Description |
---|---|
static StringDistance |
INTERNAL_LEVENSHTEIN
The default StringDistance, Damerau-Levenshtein distance implemented internally
via
LevenshteinAutomata . |
Constructor and Description |
---|
DirectSpellChecker()
Creates a DirectSpellChecker with default configuration values
|
Modifier and Type | Method and Description |
---|---|
float |
getAccuracy()
Get the minimal accuracy from the StringDistance for a match
|
Comparator<SuggestWord> |
getComparator()
Get the current comparator in use.
|
StringDistance |
getDistance()
Get the string distance metric in use.
|
boolean |
getLowerCaseTerms()
true if the spellchecker should lowercase terms
|
int |
getMaxEdits()
Get the maximum number of Levenshtein edit-distances to draw
candidate terms from.
|
int |
getMaxInspections()
Get the maximum number of top-N inspections per suggestion
|
float |
getMaxQueryFrequency()
Get the maximum threshold of documents a query term can appear in order
to provide suggestions.
|
int |
getMinPrefix()
Get the minimal number of characters that must match exactly
|
int |
getMinQueryLength()
Get the minimum length of a query term needed to return suggestions
|
float |
getThresholdFrequency()
Get the minimal threshold of documents a term must appear for a match
|
void |
setAccuracy(float accuracy)
Set the minimal accuracy required (default: 0.5f) from a StringDistance
for a suggestion match.
|
void |
setComparator(Comparator<SuggestWord> comparator)
Set the comparator for sorting suggestions.
|
void |
setDistance(StringDistance distance)
Set the string distance metric.
|
void |
setLowerCaseTerms(boolean lowerCaseTerms)
True if the spellchecker should lowercase terms (default: true)
|
void |
setMaxEdits(int maxEdits)
Sets the maximum number of Levenshtein edit-distances to draw
candidate terms from.
|
void |
setMaxInspections(int maxInspections)
Set the maximum number of top-N inspections (default: 5) per suggestion.
|
void |
setMaxQueryFrequency(float maxQueryFrequency)
Set the maximum threshold (default: 0.01f) of documents a query term can
appear in order to provide suggestions.
|
void |
setMinPrefix(int minPrefix)
Sets the minimal number of initial characters (default: 1)
that must match exactly.
|
void |
setMinQueryLength(int minQueryLength)
Set the minimum length of a query term (default: 4) needed to return suggestions.
|
void |
setThresholdFrequency(float thresholdFrequency)
Set the minimal threshold of documents a term must appear for a match.
|
SuggestWord[] |
suggestSimilar(Term term,
int numSug,
IndexReader ir)
|
protected Collection<DirectSpellChecker.ScoreTerm> |
suggestSimilar(Term term,
int numSug,
IndexReader ir,
int docfreq,
int editDistance,
float accuracy,
CharsRefBuilder spare)
Provide spelling corrections based on several parameters.
|
SuggestWord[] |
suggestSimilar(Term term,
int numSug,
IndexReader ir,
SuggestMode suggestMode)
|
SuggestWord[] |
suggestSimilar(Term term,
int numSug,
IndexReader ir,
SuggestMode suggestMode,
float accuracy)
Suggest similar words.
|
public static final StringDistance INTERNAL_LEVENSHTEIN
LevenshteinAutomata
.
Note: this is the fastest distance metric, because Damerau-Levenshtein is used to draw candidates from the term dictionary: this just re-uses the scoring.
public DirectSpellChecker()
public int getMaxEdits()
public void setMaxEdits(int maxEdits)
Note: a large number of spelling errors occur with an edit distance of 1, by setting this value to 1 you can increase both performance and precision at the cost of recall.
public int getMinPrefix()
public void setMinPrefix(int minPrefix)
This can improve both performance and accuracy of results, as misspellings are commonly not the first character.
public int getMaxInspections()
public void setMaxInspections(int maxInspections)
Increasing this number can improve the accuracy of results, at the cost of performance.
public float getAccuracy()
public void setAccuracy(float accuracy)
public float getThresholdFrequency()
public void setThresholdFrequency(float thresholdFrequency)
This can improve quality by only suggesting high-frequency terms. Note that
very high values might decrease performance slightly, by forcing the spellchecker
to draw more candidates from the term dictionary, but a practical value such
as 1
can be very useful towards improving quality.
This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.
public int getMinQueryLength()
public void setMinQueryLength(int minQueryLength)
Very short query terms will often cause only bad suggestions with any distance metric.
public float getMaxQueryFrequency()
public void setMaxQueryFrequency(float maxQueryFrequency)
Very high-frequency terms are typically spelled correctly. Additionally, this can increase performance as it will do no work for the common case of correctly-spelled input terms.
This can be specified as a relative percentage of documents such as 0.5f, or it can be specified as an absolute whole document frequency, such as 4f. Absolute document frequencies may not be fractional.
public boolean getLowerCaseTerms()
public void setLowerCaseTerms(boolean lowerCaseTerms)
This is a convenience method, if your index field has more complicated analysis (such as StandardTokenizer removing punctuation), it's probably better to turn this off, and instead run your query terms through your Analyzer first.
If this option is not on, case differences count as an edit!
public Comparator<SuggestWord> getComparator()
public void setComparator(Comparator<SuggestWord> comparator)
SuggestWordQueue.DEFAULT_COMPARATOR
public StringDistance getDistance()
public void setDistance(StringDistance distance)
INTERNAL_LEVENSHTEIN
Note: because this spellchecker draws its candidates from the term
dictionary using Damerau-Levenshtein, it works best with an edit-distance-like
string metric. If you use a different metric than the default,
you might want to consider increasing setMaxInspections(int)
to draw more candidates for your metric to rank.
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir) throws IOException
IOException
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode) throws IOException
IOException
public SuggestWord[] suggestSimilar(Term term, int numSug, IndexReader ir, SuggestMode suggestMode, float accuracy) throws IOException
Unlike SpellChecker
, the similarity used to fetch the most
relevant terms is an edit distance, therefore typically a low value
for numSug will work very well.
term
- Term you want to spell check onnumSug
- the maximum number of suggested wordsir
- IndexReader to find terms fromsuggestMode
- specifies when to return suggested wordsaccuracy
- return only suggested words that match with this similarityIOException
- If there is a low-level I/O error.protected Collection<DirectSpellChecker.ScoreTerm> suggestSimilar(Term term, int numSug, IndexReader ir, int docfreq, int editDistance, float accuracy, CharsRefBuilder spare) throws IOException
term
- The term to suggest spelling corrections fornumSug
- The maximum number of spelling correctionsir
- The index reader to fetch the candidate spelling corrections fromdocfreq
- The minimum document frequency a potential suggestion need to have in order to be includededitDistance
- The maximum edit distance candidates are allowed to haveaccuracy
- The minimum accuracy a suggested spelling correction needs to have in order to be includedspare
- a chars scratchScoreTerm
's natural order.IOException
- If I/O related errors occurCopyright © 2000-2019 Apache Software Foundation. All Rights Reserved.