Class BKDWriter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class BKDWriter
    extends Object
    implements Closeable
    Recursively builds a block KD-tree to assign all incoming points in N-dim space to smaller and smaller N-dim rectangles (cells) until the number of points in a given rectangle is <= maxPointsInLeafNode. The tree is fully balanced, which means the leaf nodes will have between 50% and 100% of the requested maxPointsInLeafNode. Values that fall exactly on a cell boundary may be in either cell.

    The number of dimensions can be 1 to 8, but every byte[] value is fixed length.

    See this paper for details.

    This consumes heap during writing: it allocates a LongBitSet(numPoints), and then uses up to the specified maxMBSortInHeap heap space for writing.

    NOTE: This can write at most Integer.MAX_VALUE * maxPointsInLeafNode total points.

    WARNING: This API is experimental and might change in incompatible ways in the next release.
    • Field Detail

      • VERSION_COMPRESSED_DOC_IDS

        public static final int VERSION_COMPRESSED_DOC_IDS
        See Also:
        Constant Field Values
      • VERSION_COMPRESSED_VALUES

        public static final int VERSION_COMPRESSED_VALUES
        See Also:
        Constant Field Values
      • VERSION_IMPLICIT_SPLIT_DIM_1D

        public static final int VERSION_IMPLICIT_SPLIT_DIM_1D
        See Also:
        Constant Field Values
      • VERSION_LEAF_STORES_BOUNDS

        public static final int VERSION_LEAF_STORES_BOUNDS
        See Also:
        Constant Field Values
      • VERSION_SELECTIVE_INDEXING

        public static final int VERSION_SELECTIVE_INDEXING
        See Also:
        Constant Field Values
      • DEFAULT_MAX_POINTS_IN_LEAF_NODE

        public static final int DEFAULT_MAX_POINTS_IN_LEAF_NODE
        Default maximum number of point in each leaf block
        See Also:
        Constant Field Values
      • DEFAULT_MAX_MB_SORT_IN_HEAP

        public static final float DEFAULT_MAX_MB_SORT_IN_HEAP
        Default maximum heap to use, before spilling to (slower) disk
        See Also:
        Constant Field Values
      • MAX_DIMS

        public static final int MAX_DIMS
        Maximum number of dimensions
        See Also:
        Constant Field Values
      • numDataDims

        protected final int numDataDims
        How many dimensions we are storing at the leaf (data) nodes
      • numIndexDims

        protected final int numIndexDims
        How many dimensions we are indexing in the internal nodes
      • bytesPerDim

        protected final int bytesPerDim
        How many bytes each value in each dimension takes.
      • packedBytesLength

        protected final int packedBytesLength
        numDataDims * bytesPerDim
      • packedIndexBytesLength

        protected final int packedIndexBytesLength
        numIndexDims * bytesPerDim
      • maxPointsInLeafNode

        protected final int maxPointsInLeafNode
      • minPackedValue

        protected final byte[] minPackedValue
        Minimum per-dim values, packed
      • maxPackedValue

        protected final byte[] maxPackedValue
        Maximum per-dim values, packed
      • pointCount

        protected long pointCount
      • longOrds

        protected final boolean longOrds
        true if we have so many values that we must write ords using long (8 bytes) instead of int (4 bytes)
      • singleValuePerDoc

        protected final boolean singleValuePerDoc
        True if every document has at most one value. We specialize this case by not bothering to store the ord since it's redundant with docID.
      • offlineSorterBufferMB

        protected final OfflineSorter.BufferSize offlineSorterBufferMB
        How much heap OfflineSorter is allowed to use
      • offlineSorterMaxTempFiles

        protected final int offlineSorterMaxTempFiles
        How much heap OfflineSorter is allowed to use
    • Constructor Detail

      • BKDWriter

        public BKDWriter​(int maxDoc,
                         Directory tempDir,
                         String tempFileNamePrefix,
                         int numDataDims,
                         int numIndexDims,
                         int bytesPerDim,
                         int maxPointsInLeafNode,
                         double maxMBSortInHeap,
                         long totalPointCount,
                         boolean singleValuePerDoc)
                  throws IOException
        Throws:
        IOException
      • BKDWriter

        protected BKDWriter​(int maxDoc,
                            Directory tempDir,
                            String tempFileNamePrefix,
                            int numDataDims,
                            int numIndexDims,
                            int bytesPerDim,
                            int maxPointsInLeafNode,
                            double maxMBSortInHeap,
                            long totalPointCount,
                            boolean singleValuePerDoc,
                            boolean longOrds,
                            long offlineSorterBufferMB,
                            int offlineSorterMaxTempFiles)
                     throws IOException
        Throws:
        IOException
    • Method Detail

      • verifyParams

        public static void verifyParams​(int numDataDims,
                                        int numIndexDims,
                                        int maxPointsInLeafNode,
                                        double maxMBSortInHeap,
                                        long totalPointCount)
      • getPointCount

        public long getPointCount()
        How many points have been added so far
      • split

        protected int split​(byte[] minPackedValue,
                            byte[] maxPackedValue,
                            int[] parentSplits)
        Pick the next dimension to split.
        Parameters:
        minPackedValue - the min values for all dimensions
        maxPackedValue - the max values for all dimensions
        parentSplits - how many times each dim has been split on the parent levels
        Returns:
        the dimension to split