Class NGramFragmentChecker
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.NGramFragmentChecker
-
- All Implemented Interfaces:
FragmentChecker
public class NGramFragmentChecker extends Object implements FragmentChecker
AFragmentChecker
based on all character n-grams possible in a certain language, keeping them in a relatively memory-efficient, but probabilistic data structure. The n-gram length should be 2, 3 or 4.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interface
NGramFragmentChecker.NGramConsumer
A callback for n-gram ranges in words
-
Field Summary
-
Fields inherited from interface org.apache.lucene.analysis.hunspell.FragmentChecker
EVERYTHING_POSSIBLE
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static NGramFragmentChecker
fromAllSimpleWords(int n, Dictionary dictionary, Runnable checkCanceled)
Iterate the whole dictionary, derive all word forms (usingWordFormGenerator
), vary the case to get all words acceptable by the spellchecker, and create a fragment checker based on theirn
-grams.static NGramFragmentChecker
fromWords(int n, Collection<? extends CharSequence> words)
Create a fragment checker for n-grams found in the given words.boolean
hasImpossibleFragmentAround(CharSequence word, int start, int end)
Check if the given word range intersects any fragment which is impossible in the current language.static void
processNGrams(int n, Dictionary dictionary, Runnable checkCanceled, NGramFragmentChecker.NGramConsumer consumer)
Traverse the whole dictionary, generate all word forms of its entries, and process all n-grams in these word forms.
-
-
-
Method Detail
-
hasImpossibleFragmentAround
public boolean hasImpossibleFragmentAround(CharSequence word, int start, int end)
Description copied from interface:FragmentChecker
Check if the given word range intersects any fragment which is impossible in the current language. For example, if the word is "aaax", and there are no "aaa" combinations in words accepted by the spellchecker (but "aax" is valid), thentrue
can be returned for all ranges in0..3
, but not for3..4
.The implementation must be monotonic: if some range is considered impossible, larger ranges encompassing it should also produce
true
.- Specified by:
hasImpossibleFragmentAround
in interfaceFragmentChecker
- Parameters:
word
- the whole word being checked for impossible substringsstart
- the start of the range in question, inclusiveend
- the end of the range in question, inclusive, not smaller thanstart
-
fromAllSimpleWords
public static NGramFragmentChecker fromAllSimpleWords(int n, Dictionary dictionary, Runnable checkCanceled)
Iterate the whole dictionary, derive all word forms (usingWordFormGenerator
), vary the case to get all words acceptable by the spellchecker, and create a fragment checker based on theirn
-grams. Note that this enumerates only words derivable by suffixes and prefixes. If the language has compounds, some n-grams possible via those compounds can be missed. In the latter case, consider usingfromWords(int, java.util.Collection<? extends java.lang.CharSequence>)
.- Parameters:
n
- the length of n-gramsdictionary
- the dictionary to traversecheckCanceled
- an object that's periodically called, allowing to interrupt the traversal by throwing an exception
-
fromWords
public static NGramFragmentChecker fromWords(int n, Collection<? extends CharSequence> words)
Create a fragment checker for n-grams found in the given words. The words can be n-grams themselves or full words of the language. The words are case-sensitive, so be sure to include upper-case and title-case variants if they're accepted by the spellchecker.- Parameters:
n
- the length of the ngrams to consider.words
- the strings to extract n-grams from
-
processNGrams
public static void processNGrams(int n, Dictionary dictionary, Runnable checkCanceled, NGramFragmentChecker.NGramConsumer consumer)
Traverse the whole dictionary, generate all word forms of its entries, and process all n-grams in these word forms. No duplication removal is done, so theconsumer
should be prepared to duplicate n-grams. The traversal order is undefined.- Parameters:
n
- the length of the n-gramsdictionary
- the dictionary to traversecheckCanceled
- an object that's periodically called, allowing to interrupt the traversal by throwing an exceptionconsumer
- the n-gram consumer to be called for each n-gram
-
-