org.apache.lucene.misc.index.BPIndexReorderer

public final class BPIndexReorderer extends Object

Implementation of "recursive graph bisection", also called "bipartite graph partitioning" and often abbreviated BP, an approach to doc ID assignment that aims at reducing the sum of the log gap between consecutive postings. While originally targeted at reducing the size of postings, this algorithm has been observed to also speed up queries significantly by clustering documents that have similar sets of terms together.

This algorithm was initially described by Dhulipala et al. in "Compressing graphs and inverted indexes with recursive graph bisection". This implementation takes advantage of some optimizations suggested by Mackenzie et al. in "Tradeoff Options for Bipartite Graph Partitioning".

Typical usage would look like this:

 LeafReader reader; // reader to reorder
 Directory targetDir; // Directory where to write the reordered index

 Directory targetDir = FSDirectory.open(targetPath);
 BPIndexReorderer reorderer = new BPIndexReorderer();
 ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors(),
     p -> new ForkJoinWorkerThread(p) {}, null, false);
 reorderer.setForkJoinPool(pool);
 reorderer.setFields(Collections.singleton("body"));
 CodecReader reorderedReaderView = reorderer.reorder(SlowCodecReaderWrapper.wrap(reader), targetDir);
 try (IndexWriter w = new IndexWriter(targetDir, new IndexWriterConfig().setOpenMode(OpenMode.CREATE))) {
   w.addIndexes(reorderedReaderView);
 }
 DirectoryReader reorderedReader = DirectoryReader.open(targetDir);

Note: This is a slow operation that consumes O(maxDoc + numTerms * numThreads) memory.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

BPIndexReorderer.NotEnoughRAMException

Exception that is thrown when not enough RAM is available.
Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_MAX_ITERS

Default maximum number of iterations per recursion level: 20.

static final int

DEFAULT_MIN_DOC_FREQ

Minimum required document frequency for terms to be considered: 4,096.

static final int

DEFAULT_MIN_PARTITION_SIZE

Minimum size of partitions.
Constructor Summary

Constructors

Constructor

Description

BPIndexReorderer()

Constructor.
Method Summary

Modifier and Type

Method

Description

Sorter.DocMap

computeDocMap(CodecReader reader, Directory tempDir)

Expert: Compute the Sorter.DocMap that holds the new doc ID numbering.

CodecReader

reorder(CodecReader reader, Directory tempDir)

Reorder the given CodecReader into a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency.

void

setFields(Set<String> fields)

Sets the fields to use to perform partitioning.

void

setForkJoinPool(ForkJoinPool forkJoinPool)

Set the ForkJoinPool to run graph partitioning concurrently.

void

setMaxDocFreq(float maxDocFreq)

Set the maximum document frequency for terms to be considered, as a ratio of maxDoc.

void

setMaxIters(int maxIters)

Set the maximum number of iterations on each recursion level, 20 by default.

void

setMinDocFreq(int minDocFreq)

Set the minimum document frequency for terms to be considered, 4096 by default.

void

setMinPartitionSize(int minPartitionSize)

Set the minimum partition size, when the algorithm stops recursing, 32 by default.

void

setRAMBudgetMB(double ramBudgetMB)

Set the amount of RAM that graph partitioning is allowed to use.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_MIN_DOC_FREQ
  
  public static final int DEFAULT_MIN_DOC_FREQ
  
  Minimum required document frequency for terms to be considered: 4,096.
  See Also:
  
  Constant Field Values
- DEFAULT_MIN_PARTITION_SIZE
  
  public static final int DEFAULT_MIN_PARTITION_SIZE
  
  Minimum size of partitions. The algorithm will stop recursing when reaching partitions below this number of documents: 32.
  See Also:
  
  Constant Field Values
- DEFAULT_MAX_ITERS
  
  public static final int DEFAULT_MAX_ITERS
  
  Default maximum number of iterations per recursion level: 20. Higher numbers of iterations typically don't help significantly.
  See Also:
  
  Constant Field Values
Constructor Details
- BPIndexReorderer
  
  public BPIndexReorderer()
  
  Constructor.
Method Details
- setMinDocFreq
  
  public void setMinDocFreq(int minDocFreq)
  
  Set the minimum document frequency for terms to be considered, 4096 by default.
- setMaxDocFreq
  
  public void setMaxDocFreq(float maxDocFreq)
  
  Set the maximum document frequency for terms to be considered, as a ratio of maxDoc. This is useful because very frequent terms (stop words) add significant overhead to the reordering logic while not being very relevant for ordering. This value must be in (0, 1]. Default value is 1.
- setMinPartitionSize
  
  public void setMinPartitionSize(int minPartitionSize)
  
  Set the minimum partition size, when the algorithm stops recursing, 32 by default.
- setMaxIters
  
  public void setMaxIters(int maxIters)
  
  Set the maximum number of iterations on each recursion level, 20 by default. Experiments suggests that values above 20 do not help much. However, values below 20 can be used to trade effectiveness for faster reordering.
- setForkJoinPool
  
  public void setForkJoinPool(ForkJoinPool forkJoinPool)
  
  Set the ForkJoinPool to run graph partitioning concurrently.
  NOTE: A value of null can be used to run in the current thread, which is the default.
- setRAMBudgetMB
  
  public void setRAMBudgetMB(double ramBudgetMB)
  
  Set the amount of RAM that graph partitioning is allowed to use. More RAM allows running faster. If not enough RAM is provided, a BPIndexReorderer.NotEnoughRAMException will be thrown. This is 10% of the total heap size by default.
- setFields
  
  public void setFields(Set<String> fields)
  
  Sets the fields to use to perform partitioning. A null value indicates that all indexed fields should be used.
- computeDocMap
  
  public Sorter.DocMap computeDocMap(CodecReader reader, Directory tempDir) throws IOException
  
  Expert: Compute the Sorter.DocMap that holds the new doc ID numbering. This is exposed to enable integration into BPReorderingMergePolicy, reorder(CodecReader, Directory) should be preferred in general.
  
  Throws:
  
  IOException
- reorder
  
  public CodecReader reorder(CodecReader reader, Directory tempDir) throws IOException
  
  Reorder the given CodecReader into a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency. Note that the returned CodecReader is slow and should typically be used in a call to IndexWriter.addIndexes(CodecReader...).
  
  Throws:
  
  BPIndexReorderer.NotEnoughRAMException - if not enough RAM is provided
  
  IOException

Class BPIndexReorderer

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MIN_DOC_FREQ

DEFAULT_MIN_PARTITION_SIZE

DEFAULT_MAX_ITERS

Constructor Details

BPIndexReorderer

Method Details

setMinDocFreq

setMaxDocFreq

setMinPartitionSize

setMaxIters

setForkJoinPool

setRAMBudgetMB

setFields

computeDocMap

reorder