public abstract class TermPruningPolicy extends PruningPolicy
pruneAllFieldPostings(String)
pruneTermEnum(TermEnum)
The pruned, smaller index would, for many types of queries return nearly identical top-N results as compared with the original index, but with increased performance.
Pruning of indexes is handy for producing small first-tier indexes that fit
completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader...)
Interestingly, if the input index is optimized (i.e. doesn't contain deletions),
then the index produced via IndexWriter.addIndexes(IndexReader[])
will preserve internal document
id-s so that they are in sync with the original index. This means that
all other auxiliary information not necessary for first-tier processing, such
as some stored fields, can also be removed, to be quickly retrieved on-demand
from the original index using the same internal document id. See
StorePruningPolicy
for information about removing stored fields.
Please note that while this family of policies method produces good results for term queries it often leads to poor results for phrase queries (because postings are removed without considering whether they belong to an important phrase).
Aggressive pruning policies produce smaller indexes - search performance increases, and recall decreases (i.e. search quality deteriorates).
See the following papers for a discussion of this problem and the
proposed solutions to improve the quality of a pruned index (not implemented
here):
Modifier and Type | Field and Description |
---|---|
protected Map<String,Integer> |
fieldFlags
Pruning operations to be conducted on fields.
|
protected IndexReader |
in |
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
Modifier | Constructor and Description |
---|---|
protected |
TermPruningPolicy(IndexReader in,
Map<String,Integer> fieldFlags)
Construct a policy.
|
Modifier and Type | Method and Description |
---|---|
abstract void |
initPositionsTerm(TermPositions in,
Term t)
Called when moving
TermPositions to a new Term . |
boolean |
pruneAllFieldPostings(String field)
Pruning of all postings for a field
|
abstract boolean |
pruneAllPositions(TermPositions termPositions,
Term t)
Prune all postings per term (invoked once per term per doc)
|
boolean |
prunePayload(TermPositions in,
Term curTerm)
Called when checking for the presence of payload for the current
term at a current position
|
abstract int |
pruneSomePositions(int docNum,
int[] positions,
Term curTerm)
Prune some postings per term (invoked once per term per doc).
|
abstract boolean |
pruneTermEnum(TermEnum te)
Pruning of all postings for a term (invoked once per term).
|
abstract int |
pruneTermVectorTerms(int docNumber,
String field,
String[] terms,
int[] freqs,
TermFreqVector v)
Pruning of individual terms in term vectors.
|
boolean |
pruneWholeTermVector(int docNumber,
String field)
Term vector pruning.
|
protected IndexReader in
protected TermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
in
- input readerfieldFlags
- a map, where keys are field names and values
are bitwise-OR flags of operations to be performed (see
PruningPolicy
for more details).public boolean pruneWholeTermVector(int docNumber, String field) throws IOException
docNumber
- document numberfield
- field namePruningPolicy.DEL_VECTOR
flag).IOException
public boolean pruneAllFieldPostings(String field) throws IOException
field
- field namePruningPolicy.DEL_POSTINGS
).IOException
public abstract void initPositionsTerm(TermPositions in, Term t) throws IOException
TermPositions
to a new Term
.in
- input term positionst
- current termIOException
public boolean prunePayload(TermPositions in, Term curTerm)
in
- positioned term positionscurTerm
- current term associated with these positionspublic abstract int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector v) throws IOException
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciesv
- the original term frequency vectorIOException
public abstract boolean pruneTermEnum(TermEnum te) throws IOException
te
- positioned term enum.IOException
public abstract boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
termPositions
- positioned term positions. Implementations MUST NOT
advance this by calling TermPositions
methods that advance either
the position pointer (next, skipTo) or term pointer (seek).t
- current termIOException
public abstract int pruneSomePositions(int docNum, int[] positions, Term curTerm)
docNum
- current document numberpositions
- original term positions in the document (and indirectly
term frequency)curTerm
- current term