public class CarmelUniformTermPruningPolicy extends TermPruningPolicy
TermPositions whose in-document frequency is below a specified
CarmelTopKTermPruningPolicy for link to the paper describing this
policy. are pruned.
Conclusions of that paper indicate that it's best to compute per-term
thresholds, as we do in
CarmelTopKTermPruningPolicy. However for
large indexes with a large number of terms that method might be too slow, and
the (enhanced) uniform approach implemented here may will be faster, although
it might produce inferior search quality.
This implementation enhances the Carmel uniform pruning approach, as it allows to specify three levels of thresholds:
These thresholds are applied so that always the most specific one takes precedence: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
Threshold are maintained in a map, keyed by either field names or terms in
field:text format. precedence of these values is the following:
Thresholds in this method of pruning are expressed as the percentage of the
top-N scoring documents per term that are retained. The list of top-N
documents is established by using a regular
Similarity to run a simple
Smaller threshold value will produce a smaller index. See
TermPruningPolicy for size vs performance considerations.
For indexes with a large number of terms this policy might be still too slow,
since it issues a term query for each term in the index. In such situations,
the term frequency pruning approach in
TFTermPruningPolicy will be
faster, though it might produce inferior search quality.
|Modifier and Type||Class and Description|
|Constructor and Description|
|Modifier and Type||Method and Description|
Called when moving
Prune all postings per term (invoked once per term per doc)
Prune some postings per term (invoked once per term per doc).
Pruning of all postings for a term (invoked once per term).
Pruning of individual terms in term vectors.
pruneAllFieldPostings, prunePayload, pruneWholeTermVector
public boolean pruneTermEnum(org.apache.lucene.index.TermEnum te) throws IOException
public void initPositionsTerm(org.apache.lucene.index.TermPositions tp, org.apache.lucene.index.Term t) throws IOException
TermPositionsto a new
public boolean pruneAllPositions(org.apache.lucene.index.TermPositions termPositions, org.apache.lucene.index.Term t) throws IOException
termPositions- positioned term positions. Implementations MUST NOT advance this by calling
TermPositionsmethods that advance either the position pointer (next, skipTo) or term pointer (seek).
t- current term
public int pruneTermVectorTerms(int docNumber, String field, String terms, int freqs, org.apache.lucene.index.TermFreqVector tfv) throws IOException
docNumber- document number
field- field name
terms- array of terms
freqs- array of term frequencies
tfv- the original term frequency vector
public int pruneSomePositions(int docNum, int positions, org.apache.lucene.index.Term curTerm)
docNum- current document number
positions- original term positions in the document (and indirectly term frequency)
curTerm- current term