public class FreeTextSuggester extends Lookup
build(org.apache.lucene.search.suggest.InputIterator)
and predicts based on the last grams-1 tokens in
the request sent to lookup(java.lang.CharSequence, boolean, int)
. This tries to
handle the "long tail" of suggestions for when the
incoming query is a never before seen query string.
Likely this suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.
Note that the weight for each suggestion is unused, and the suggestions are the analyzed forms (so your analysis process should normally be very "light").
This uses the stupid backoff language model to smooth scores across ngram models; see "Large language models in machine translation", http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126 for details.
From lookup(java.lang.CharSequence, boolean, int)
, the key of each result is the
ngram token; the value is Long.MAX_VALUE * score (fixed
point, cast to long). Divide by Long.MAX_VALUE to get
the score back, which ranges from 0.0 to 1.0.
onlyMorePopular is unused.
Lookup.LookupPriorityQueue, Lookup.LookupResult
Modifier and Type | Field and Description |
---|---|
static double |
ALPHA
The constant used for backoff smoothing; during
lookup, this means that if a given trigram did not
occur, and we backoff to the bigram, the overall score
will be 0.4 times what the bigram model would have
assigned.
|
static String |
CODEC_NAME
Codec name used in the header for the saved model.
|
static int |
DEFAULT_GRAMS
By default we use a bigram model.
|
static byte |
DEFAULT_SEPARATOR
The default character used to join multiple tokens
into a single ngram token.
|
static int |
VERSION_CURRENT
Current version of the the saved model file format.
|
static int |
VERSION_START
Initial version of the the saved model file format.
|
CHARSEQUENCE_COMPARATOR
Constructor and Description |
---|
FreeTextSuggester(Analyzer analyzer)
Instantiate, using the provided analyzer for both
indexing and lookup, using bigram model by default.
|
FreeTextSuggester(Analyzer indexAnalyzer,
Analyzer queryAnalyzer)
Instantiate, using the provided indexing and lookup
analyzers, using bigram model by default.
|
FreeTextSuggester(Analyzer indexAnalyzer,
Analyzer queryAnalyzer,
int grams)
Instantiate, using the provided indexing and lookup
analyzers, with the specified model (2
= bigram, 3 = trigram, etc.).
|
FreeTextSuggester(Analyzer indexAnalyzer,
Analyzer queryAnalyzer,
int grams,
byte separator)
Instantiate, using the provided indexing and lookup
analyzers, and specified model (2 = bigram, 3 =
trigram ,etc.).
|
Modifier and Type | Method and Description |
---|---|
void |
build(InputIterator iterator)
Builds up a new internal
Lookup representation based on the given InputIterator . |
void |
build(InputIterator iterator,
double ramBufferSizeMB)
Build the suggest index, using up to the specified
amount of temporary RAM while building.
|
Object |
get(CharSequence key)
Returns the weight associated with an input string,
or null if it does not exist.
|
Collection<Accountable> |
getChildResources()
Returns nested resources of this class.
|
long |
getCount()
Get the number of entries the lookup was built with
|
boolean |
load(DataInput input)
Discard current lookup data and load it from a previously saved copy.
|
List<Lookup.LookupResult> |
lookup(CharSequence key,
boolean onlyMorePopular,
int num)
Look up a key and return possible completion for this key.
|
List<Lookup.LookupResult> |
lookup(CharSequence key,
int num)
Lookup, without any context.
|
List<Lookup.LookupResult> |
lookup(CharSequence key,
Set<BytesRef> contexts,
boolean onlyMorePopular,
int num)
Look up a key and return possible completion for this key.
|
List<Lookup.LookupResult> |
lookup(CharSequence key,
Set<BytesRef> contexts,
int num)
Retrieve suggestions.
|
long |
ramBytesUsed()
Returns byte size of the underlying FST.
|
boolean |
store(DataOutput output)
Persist the constructed lookup data to a directory.
|
public static final String CODEC_NAME
public static final int VERSION_START
public static final int VERSION_CURRENT
public static final int DEFAULT_GRAMS
public static final double ALPHA
public static final byte DEFAULT_SEPARATOR
public FreeTextSuggester(Analyzer analyzer)
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
public FreeTextSuggester(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
ShingleFilter.setTokenSeparator(java.lang.String)
to join multiple
tokens into a single ngram token; it must be an ascii
(7-bit-clean) byte. No input tokens should have this
byte, otherwise IllegalArgumentException
is
thrown.public long ramBytesUsed()
public Collection<Accountable> getChildResources()
Lookup
getChildResources
in interface Accountable
getChildResources
in class Lookup
Accountables
public void build(InputIterator iterator) throws IOException
Lookup
Lookup
representation based on the given InputIterator
.
The implementation might re-sort the data internally.build
in class Lookup
IOException
public void build(InputIterator iterator, double ramBufferSizeMB) throws IOException
IOException
public boolean store(DataOutput output) throws IOException
Lookup
store
in class Lookup
output
- DataOutput
to write the data to.IOException
- when fatal IO error occurs.public boolean load(DataInput input) throws IOException
Lookup
load
in class Lookup
input
- the DataInput
to load the lookup data.IOException
- when fatal IO error occurs.public List<Lookup.LookupResult> lookup(CharSequence key, boolean onlyMorePopular, int num)
Lookup
lookup
in class Lookup
key
- lookup key. Depending on the implementation this may be
a prefix, misspelling, or even infix.onlyMorePopular
- return only more popular resultsnum
- maximum number of results to returnpublic List<Lookup.LookupResult> lookup(CharSequence key, int num)
public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num)
Lookup
lookup
in class Lookup
key
- lookup key. Depending on the implementation this may be
a prefix, misspelling, or even infix.contexts
- contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a matchonlyMorePopular
- return only more popular resultsnum
- maximum number of results to returnpublic long getCount()
Lookup
public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, int num) throws IOException
IOException
public Object get(CharSequence key)
Copyright © 2000-2015 Apache Software Foundation. All Rights Reserved.