Package org.apache.lucene.misc.index
Class BPIndexReorderer
java.lang.Object
org.apache.lucene.misc.index.BPIndexReorderer
Implementation of "recursive graph bisection", also called "bipartite graph partitioning" and
often abbreviated BP, an approach to doc ID assignment that aims at reducing the sum of the log
gap between consecutive postings. While originally targeted at reducing the size of postings,
this algorithm has been observed to also speed up queries significantly by clustering documents
that have similar sets of terms together.
This algorithm was initially described by Dhulipala et al. in "Compressing graphs and inverted indexes with recursive graph bisection". This implementation takes advantage of some optimizations suggested by Mackenzie et al. in "Tradeoff Options for Bipartite Graph Partitioning".
Typical usage would look like this:
LeafReader reader; // reader to reorder Directory targetDir; // Directory where to write the reordered index Directory targetDir = FSDirectory.open(targetPath); BPIndexReorderer reorderer = new BPIndexReorderer(); ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors(), p -> new ForkJoinWorkerThread(p) {}, null, false); reorderer.setForkJoinPool(pool); reorderer.setFields(Collections.singleton("body")); CodecReader reorderedReaderView = reorderer.reorder(SlowCodecReaderWrapper.wrap(reader), targetDir); try (IndexWriter w = new IndexWriter(targetDir, new IndexWriterConfig().setOpenMode(OpenMode.CREATE))) { w.addIndexes(reorderedReaderView); } DirectoryReader reorderedReader = DirectoryReader.open(targetDir);
Note: This is a slow operation that consumes O(maxDoc + numTerms * numThreads) memory.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Exception that is thrown when not enough RAM is available. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
Default maximum number of iterations per recursion level: 20.static final int
Minimum required document frequency for terms to be considered: 4,096.static final int
Minimum size of partitions. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncomputeDocMap
(CodecReader reader, Directory tempDir, Executor executor) Expert: Compute theSorter.DocMap
that holds the new doc ID numbering.reorder
(CodecReader reader, Directory tempDir, Executor executor) Reorder the givenCodecReader
into a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency.void
Sets the fields to use to perform partitioning.void
setMaxDocFreq
(float maxDocFreq) Set the maximum document frequency for terms to be considered, as a ratio ofmaxDoc
.void
setMaxIters
(int maxIters) Set the maximum number of iterations on each recursion level, 20 by default.void
setMinDocFreq
(int minDocFreq) Set the minimum document frequency for terms to be considered, 4096 by default.void
setMinPartitionSize
(int minPartitionSize) Set the minimum partition size, when the algorithm stops recursing, 32 by default.void
setRAMBudgetMB
(double ramBudgetMB) Set the amount of RAM that graph partitioning is allowed to use.
-
Field Details
-
DEFAULT_MIN_DOC_FREQ
public static final int DEFAULT_MIN_DOC_FREQMinimum required document frequency for terms to be considered: 4,096.- See Also:
-
DEFAULT_MIN_PARTITION_SIZE
public static final int DEFAULT_MIN_PARTITION_SIZEMinimum size of partitions. The algorithm will stop recursing when reaching partitions below this number of documents: 32.- See Also:
-
DEFAULT_MAX_ITERS
public static final int DEFAULT_MAX_ITERSDefault maximum number of iterations per recursion level: 20. Higher numbers of iterations typically don't help significantly.- See Also:
-
-
Constructor Details
-
BPIndexReorderer
public BPIndexReorderer()Constructor.
-
-
Method Details
-
setMinDocFreq
public void setMinDocFreq(int minDocFreq) Set the minimum document frequency for terms to be considered, 4096 by default. -
setMaxDocFreq
public void setMaxDocFreq(float maxDocFreq) Set the maximum document frequency for terms to be considered, as a ratio ofmaxDoc
. This is useful because very frequent terms (stop words) add significant overhead to the reordering logic while not being very relevant for ordering. This value must be in (0, 1]. Default value is 1. -
setMinPartitionSize
public void setMinPartitionSize(int minPartitionSize) Set the minimum partition size, when the algorithm stops recursing, 32 by default. -
setMaxIters
public void setMaxIters(int maxIters) Set the maximum number of iterations on each recursion level, 20 by default. Experiments suggests that values above 20 do not help much. However, values below 20 can be used to trade effectiveness for faster reordering. -
setRAMBudgetMB
public void setRAMBudgetMB(double ramBudgetMB) Set the amount of RAM that graph partitioning is allowed to use. More RAM allows running faster. If not enough RAM is provided, aBPIndexReorderer.NotEnoughRAMException
will be thrown. This is 10% of the total heap size by default. -
setFields
Sets the fields to use to perform partitioning. Anull
value indicates that all indexed fields should be used. -
computeDocMap
public Sorter.DocMap computeDocMap(CodecReader reader, Directory tempDir, Executor executor) throws IOException Expert: Compute theSorter.DocMap
that holds the new doc ID numbering. This is exposed to enable integration intoBPReorderingMergePolicy
,reorder(CodecReader, Directory, Executor)
should be preferred in general.- Throws:
IOException
-
reorder
public CodecReader reorder(CodecReader reader, Directory tempDir, Executor executor) throws IOException Reorder the givenCodecReader
into a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency. Note that the returnedCodecReader
is slow and should typically be used in a call toIndexWriter.addIndexes(CodecReader...)
.The provided
Executor
can be used to perform reordering concurrently. A value ofnull
indicates that reordering should be performed in the current thread.NOTE: The provided
Executor
must not reject tasks.- Throws:
BPIndexReorderer.NotEnoughRAMException
- if not enough RAM is providedIOException
-