org.apache.lucene.analysis.hunspell.NGramFragmentChecker

All Implemented Interfaces:: FragmentChecker

public class NGramFragmentChecker extends Object implements FragmentChecker

A FragmentChecker based on all character n-grams possible in a certain language, keeping them in a relatively memory-efficient, but probabilistic data structure. The n-gram length should be 2, 3 or 4.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static interface

NGramFragmentChecker.NGramConsumer

A callback for n-gram ranges in words
Field Summary

Fields inherited from interface org.apache.lucene.analysis.hunspell.FragmentChecker
EVERYTHING_POSSIBLE
Method Summary

Modifier and Type

Method

Description

static NGramFragmentChecker

fromAllSimpleWords(int n, Dictionary dictionary, Runnable checkCanceled)

Iterate the whole dictionary, derive all word forms (using WordFormGenerator), vary the case to get all words acceptable by the spellchecker, and create a fragment checker based on their n-grams.

static NGramFragmentChecker

fromWords(int n, Collection<? extends CharSequence> words)

Create a fragment checker for n-grams found in the given words.

boolean

hasImpossibleFragmentAround(CharSequence word, int start, int end)

Check if the given word range intersects any fragment which is impossible in the current language.

static void

processNGrams(int n, Dictionary dictionary, Runnable checkCanceled, NGramFragmentChecker.NGramConsumer consumer)

Traverse the whole dictionary, generate all word forms of its entries, and process all n-grams in these word forms.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- hasImpossibleFragmentAround
  
  public boolean hasImpossibleFragmentAround(CharSequence word, int start, int end)
  
  Description copied from interface: FragmentChecker
  
  Check if the given word range intersects any fragment which is impossible in the current language. For example, if the word is "aaax", and there are no "aaa" combinations in words accepted by the spellchecker (but "aax" is valid), then true can be returned for all ranges in 0..3, but not for 3..4.
  The implementation must be monotonic: if some range is considered impossible, larger ranges encompassing it should also produce true.
  
  Specified by:
  
  hasImpossibleFragmentAround in interface FragmentChecker
  
  Parameters:
  
  word - the whole word being checked for impossible substrings
  
  start - the start of the range in question, inclusive
  
  end - the end of the range in question, inclusive, not smaller than start
- fromAllSimpleWords
  
  public static NGramFragmentChecker fromAllSimpleWords(int n, Dictionary dictionary, Runnable checkCanceled)
  
  Iterate the whole dictionary, derive all word forms (using WordFormGenerator), vary the case to get all words acceptable by the spellchecker, and create a fragment checker based on their n-grams. Note that this enumerates only words derivable by suffixes and prefixes. If the language has compounds, some n-grams possible via those compounds can be missed. In the latter case, consider using fromWords(int, java.util.Collection<? extends java.lang.CharSequence>).
  
  Parameters:
  
  n - the length of n-grams
  
  dictionary - the dictionary to traverse
  
  checkCanceled - an object that's periodically called, allowing to interrupt the traversal by throwing an exception
- fromWords
  
  public static NGramFragmentChecker fromWords(int n, Collection<? extends CharSequence> words)
  
  Create a fragment checker for n-grams found in the given words. The words can be n-grams themselves or full words of the language. The words are case-sensitive, so be sure to include upper-case and title-case variants if they're accepted by the spellchecker.
  
  Parameters:
  
  n - the length of the ngrams to consider.
  
  words - the strings to extract n-grams from
- processNGrams
  
  public static void processNGrams(int n, Dictionary dictionary, Runnable checkCanceled, NGramFragmentChecker.NGramConsumer consumer)
  
  Traverse the whole dictionary, generate all word forms of its entries, and process all n-grams in these word forms. No duplication removal is done, so the consumer should be prepared to duplicate n-grams. The traversal order is undefined.
  
  Parameters:
  
  n - the length of the n-grams
  
  dictionary - the dictionary to traverse
  
  checkCanceled - an object that's periodically called, allowing to interrupt the traversal by throwing an exception
  
  consumer - the n-gram consumer to be called for each n-gram

Class NGramFragmentChecker

Nested Class Summary

Field Summary

Fields inherited from interface org.apache.lucene.analysis.hunspell.FragmentChecker

Method Summary

Methods inherited from class java.lang.Object

Method Details

hasImpossibleFragmentAround

fromAllSimpleWords

fromWords

processNGrams