Class NGramFragmentChecker

java.lang.Object
org.apache.lucene.analysis.hunspell.NGramFragmentChecker
All Implemented Interfaces:
FragmentChecker

public class NGramFragmentChecker extends Object implements FragmentChecker
A FragmentChecker based on all character n-grams possible in a certain language, keeping them in a relatively memory-efficient, but probabilistic data structure. The n-gram length should be 2, 3 or 4.
See Also:
  • Method Details

    • hasImpossibleFragmentAround

      public boolean hasImpossibleFragmentAround(CharSequence word, int start, int end)
      Description copied from interface: FragmentChecker
      Check if the given word range intersects any fragment which is impossible in the current language. For example, if the word is "aaax", and there are no "aaa" combinations in words accepted by the spellchecker (but "aax" is valid), then true can be returned for all ranges in 0..3, but not for 3..4.

      The implementation must be monotonic: if some range is considered impossible, larger ranges encompassing it should also produce true.

      Specified by:
      hasImpossibleFragmentAround in interface FragmentChecker
      Parameters:
      word - the whole word being checked for impossible substrings
      start - the start of the range in question, inclusive
      end - the end of the range in question, inclusive, not smaller than start
    • fromAllSimpleWords

      public static NGramFragmentChecker fromAllSimpleWords(int n, Dictionary dictionary, Runnable checkCanceled)
      Iterate the whole dictionary, derive all word forms (using WordFormGenerator), vary the case to get all words acceptable by the spellchecker, and create a fragment checker based on their n-grams. Note that this enumerates only words derivable by suffixes and prefixes. If the language has compounds, some n-grams possible via those compounds can be missed. In the latter case, consider using fromWords(int, java.util.Collection<? extends java.lang.CharSequence>).
      Parameters:
      n - the length of n-grams
      dictionary - the dictionary to traverse
      checkCanceled - an object that's periodically called, allowing to interrupt the traversal by throwing an exception
    • fromWords

      public static NGramFragmentChecker fromWords(int n, Collection<? extends CharSequence> words)
      Create a fragment checker for n-grams found in the given words. The words can be n-grams themselves or full words of the language. The words are case-sensitive, so be sure to include upper-case and title-case variants if they're accepted by the spellchecker.
      Parameters:
      n - the length of the ngrams to consider.
      words - the strings to extract n-grams from
    • processNGrams

      public static void processNGrams(int n, Dictionary dictionary, Runnable checkCanceled, NGramFragmentChecker.NGramConsumer consumer)
      Traverse the whole dictionary, generate all word forms of its entries, and process all n-grams in these word forms. No duplication removal is done, so the consumer should be prepared to duplicate n-grams. The traversal order is undefined.
      Parameters:
      n - the length of the n-grams
      dictionary - the dictionary to traverse
      checkCanceled - an object that's periodically called, allowing to interrupt the traversal by throwing an exception
      consumer - the n-gram consumer to be called for each n-gram