org.apache.lucene.search.suggest.analyzing.FreeTextSuggester

All Implemented Interfaces:: Accountable

public class FreeTextSuggester extends Lookup

Builds an ngram model from the text sent to build(org.apache.lucene.search.suggest.InputIterator) and predicts based on the last grams-1 tokens in the request sent to lookup(java.lang.CharSequence, boolean, int). This tries to handle the "long tail" of suggestions for when the incoming query is a never before seen query string.

Likely this suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.

Note that the weight for each suggestion is unused, and the suggestions are the analyzed forms (so your analysis process should normally be very "light").

This uses the stupid backoff language model to smooth scores across ngram models; see "Large language models in machine translation", http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126 for details.

From lookup(java.lang.CharSequence, boolean, int), the key of each result is the ngram token; the value is Long.MAX_VALUE * score (fixed point, cast to long). Divide by Long.MAX_VALUE to get the score back, which ranges from 0.0 to 1.0.

onlyMorePopular is unused.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup
Lookup.LookupPriorityQueue, Lookup.LookupResult
Field Summary

Fields

Modifier and Type

Field

Description

static final double

ALPHA

The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.

static final String

CODEC_NAME

Codec name used in the header for the saved model.

static final int

DEFAULT_GRAMS

By default we use a bigram model.

static final byte

DEFAULT_SEPARATOR

The default character used to join multiple tokens into a single ngram token.

static final int

VERSION_CURRENT

Current version of the saved model file format.

static final int

VERSION_START

Initial version of the saved model file format.

Fields inherited from class org.apache.lucene.search.suggest.Lookup
CHARSEQUENCE_COMPARATOR

Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
Constructor Summary

Constructors

Constructor

Description

FreeTextSuggester(Analyzer analyzer)

Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.

FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)

Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.

FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)

Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).

FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)

Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.).
Method Summary

Modifier and Type

Method

Description

void

build(InputIterator iterator)

Builds up a new internal Lookup representation based on the given InputIterator.

void

build(InputIterator iterator, double ramBufferSizeMB)

Build the suggest index, using up to the specified amount of temporary RAM while building.

Object

get(CharSequence key)

Returns the weight associated with an input string, or null if it does not exist.

Collection<Accountable>

getChildResources()

long

getCount()

Get the number of entries the lookup was built with

boolean

load(DataInput input)

Discard current lookup data and load it from a previously saved copy.

List<Lookup.LookupResult>

lookup(CharSequence key, boolean onlyMorePopular, int num)

Look up a key and return possible completion for this key.

List<Lookup.LookupResult>

lookup(CharSequence key, int num)

Lookup, without any context.

List<Lookup.LookupResult>

lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num)

Look up a key and return possible completion for this key.

List<Lookup.LookupResult>

lookup(CharSequence key, Set<BytesRef> contexts, int num)

Retrieve suggestions.

long

ramBytesUsed()

Returns byte size of the underlying FST.

boolean

store(DataOutput output)

Persist the constructed lookup data to a directory.

Methods inherited from class org.apache.lucene.search.suggest.Lookup
build, load, lookup, store

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- CODEC_NAME
  
  public static final String CODEC_NAME
  
  Codec name used in the header for the saved model.
  See Also:
  
  Constant Field Values
- VERSION_START
  
  public static final int VERSION_START
  
  Initial version of the saved model file format.
  See Also:
  
  Constant Field Values
- VERSION_CURRENT
  
  public static final int VERSION_CURRENT
  
  Current version of the saved model file format.
  See Also:
  
  Constant Field Values
- DEFAULT_GRAMS
  
  public static final int DEFAULT_GRAMS
  
  By default we use a bigram model.
  See Also:
  
  Constant Field Values
- ALPHA
  
  public static final double ALPHA
  
  The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.
  See Also:
  
  Constant Field Values
- DEFAULT_SEPARATOR
  
  public static final byte DEFAULT_SEPARATOR
  
  The default character used to join multiple tokens into a single ngram token. The input tokens produced by the analyzer must not contain this character.
  See Also:
  
  Constant Field Values
Constructor Details
- FreeTextSuggester
  
  public FreeTextSuggester(Analyzer analyzer)
  
  Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
- FreeTextSuggester
  
  public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
  
  Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
- FreeTextSuggester
  
  public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
  
  Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
- FreeTextSuggester
  
  public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
  
  Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.). The separator is passed to ShingleFilter.setTokenSeparator(java.lang.String) to join multiple tokens into a single ngram token; it must be an ascii (7-bit-clean) byte. No input tokens should have this byte, otherwise IllegalArgumentException is thrown.
Method Details
- ramBytesUsed
  
  public long ramBytesUsed()
  
  Returns byte size of the underlying FST.
- getChildResources
  
  public Collection<Accountable> getChildResources()
- build
  
  public void build(InputIterator iterator) throws IOException
  
  Description copied from class: Lookup
  
  Builds up a new internal Lookup representation based on the given InputIterator. The implementation might re-sort the data internally.
  
  Specified by:
  
  build in class Lookup
  
  Throws:
  
  IOException
- build
  
  public void build(InputIterator iterator, double ramBufferSizeMB) throws IOException
  
  Build the suggest index, using up to the specified amount of temporary RAM while building. Note that the weights for the suggestions are ignored.
  
  Throws:
  
  IOException
- store
  
  public boolean store(DataOutput output) throws IOException
  
  Description copied from class: Lookup
  
  Persist the constructed lookup data to a directory. Optional operation.
  
  Specified by:
  
  store in class Lookup
  
  Parameters:
  
  output - DataOutput to write the data to.
  
  Returns:
  
  true if successful, false if unsuccessful or not supported.
  
  Throws:
  
  IOException - when fatal IO error occurs.
- load
  
  public boolean load(DataInput input) throws IOException
  
  Description copied from class: Lookup
  
  Discard current lookup data and load it from a previously saved copy. Optional operation.
  
  Specified by:
  
  load in class Lookup
  
  Parameters:
  
  input - the DataInput to load the lookup data.
  
  Returns:
  
  true if completed successfully, false if unsuccessful or not supported.
  
  Throws:
  
  IOException - when fatal IO error occurs.
- lookup
  
  public List<Lookup.LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num)
  
  Description copied from class: Lookup
  
  Look up a key and return possible completion for this key.
  
  Overrides:
  
  lookup in class Lookup
  
  Parameters:
  
  key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
  
  onlyMorePopular - return only more popular results
  
  num - maximum number of results to return
  
  Returns:
  
  a list of possible completions, with their relative weight (e.g. popularity)
- lookup
  
  public List<Lookup.LookupResult> lookup(CharSequence key, int num)
  
  Lookup, without any context.
- lookup
  
  public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num)
  
  Description copied from class: Lookup
  
  Look up a key and return possible completion for this key.
  
  Specified by:
  
  lookup in class Lookup
  
  Parameters:
  
  key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
  
  contexts - contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a match
  
  onlyMorePopular - return only more popular results
  
  num - maximum number of results to return
  
  Returns:
  
  a list of possible completions, with their relative weight (e.g. popularity)
- getCount
  
  public long getCount()
  
  Description copied from class: Lookup
  
  Get the number of entries the lookup was built with
  
  Specified by:
  
  getCount in class Lookup
  
  Returns:
  
  total number of suggester entries
- lookup
  
  public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, int num) throws IOException
  
  Retrieve suggestions.
  
  Throws:
  
  IOException
- get
  
  public Object get(CharSequence key)
  
  Returns the weight associated with an input string, or null if it does not exist.

Class FreeTextSuggester

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup

Field Summary

Fields inherited from class org.apache.lucene.search.suggest.Lookup

Fields inherited from interface org.apache.lucene.util.Accountable

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.search.suggest.Lookup

Methods inherited from class java.lang.Object

Field Details

CODEC_NAME

VERSION_START

VERSION_CURRENT

DEFAULT_GRAMS

ALPHA

DEFAULT_SEPARATOR

Constructor Details

FreeTextSuggester

FreeTextSuggester

FreeTextSuggester

FreeTextSuggester

Method Details

ramBytesUsed

getChildResources

build

build

store

load

lookup

lookup

lookup

getCount

lookup

get