org.apache.lucene.codecs.lucene102.Lucene102BinaryQuantizedVectorsFormat

All Implemented Interfaces:: NamedSPILoader.NamedSPI

public class Lucene102BinaryQuantizedVectorsFormat extends FlatVectorsFormat

The binary quantization format used here is a per-vector optimized scalar quantization. These ideas are evolutions of LVQ proposed in Similarity search in the blink of an eye with compressed indices by Cecilia Aguerrebere et al., the previous work on globally optimized scalar quantization in Apache Lucene, and Accelerating Large-Scale Inference with Anisotropic Vector Quantization by Ruiqi Guo et. al. Also see OptimizedScalarQuantizer. Some of key features are:

Estimating the distance between two vectors using their centroid centered distance. This requires some additional corrective factors, but allows for centroid centering to occur.
Optimized scalar quantization to single bit level of centroid centered vectors.
Asymmetric quantization of vectors, where query vectors are quantized to half-byte (4 bits) precision (normalized to the centroid) and then compared directly against the single bit quantized vectors in the index.
Transforming the half-byte quantized query vectors in such a way that the comparison with single bit vectors can be done with bit arithmetic.

A previous work related to improvements over regular LVQ is Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search by Jianyang Gao, et. al.

The format is stored within two files:

.veb (vector data) file

Stores the binary quantized vectors in a flat format. Additionally, it stores each vector's corrective factors. At the end of the file, additional information is stored for vector ordinal to centroid ordinal mapping and sparse vector information.

For each vector:
- [byte] the binary quantized values, each byte holds 8 bits.
- [float] the optimized quantiles and an additional similarity dependent corrective factor.
- short the sum of the quantized components
After the vectors, sparse vector information keeping track of monotonic blocks.

.vemb (vector metadata) file

Stores the metadata for the vectors. This includes the number of vectors, the number of dimensions, and file offset information.

int the field number
int the vector encoding ordinal
int the vector similarity ordinal
vint the vector dimensions
vlong the offset to the vector data in the .veb file
vlong the length of the vector data in the .veb file
vint the number of vectors
[float] the centroid
float the centroid square magnitude
The sparse vector information, if required, mapping vector ordinal to doc ID

Field Summary

Fields

Modifier and Type

Field

Description

static final String

BINARIZED_VECTOR_COMPONENT

static final byte

INDEX_BITS

static final String

NAME

static final byte

QUERY_BITS

Fields inherited from class org.apache.lucene.codecs.KnnVectorsFormat
DEFAULT_MAX_DIMENSIONS, EMPTY
Constructor Summary

Constructors

Constructor

Description

Lucene102BinaryQuantizedVectorsFormat()

Creates a new instance with the default number of vectors per cluster.
Method Summary

Modifier and Type

Method

Description

FlatVectorsReader

fieldsReader(SegmentReadState state)

Returns a KnnVectorsReader to read the vectors from the index.

FlatVectorsWriter

fieldsWriter(SegmentWriteState state)

Returns a FlatVectorsWriter to write the vectors to the index.

int

getMaxDimensions(String fieldName)

Returns the maximum number of vector dimensions supported by this codec for the given field name

String

toString()

Methods inherited from class org.apache.lucene.codecs.KnnVectorsFormat
availableKnnVectorsFormats, forName, getName, reloadKnnVectorsFormat

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- QUERY_BITS
  
  public static final byte QUERY_BITS
  See Also:
  
  Constant Field Values
- INDEX_BITS
  
  public static final byte INDEX_BITS
  See Also:
  
  Constant Field Values
- BINARIZED_VECTOR_COMPONENT
  
  public static final String BINARIZED_VECTOR_COMPONENT
  See Also:
  
  Constant Field Values
- NAME
  
  public static final String NAME
  See Also:
  
  Constant Field Values
Constructor Details
- Lucene102BinaryQuantizedVectorsFormat
  
  public Lucene102BinaryQuantizedVectorsFormat()
  
  Creates a new instance with the default number of vectors per cluster.
Method Details
- fieldsWriter
  
  public FlatVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException
  
  Description copied from class: FlatVectorsFormat
  
  Returns a FlatVectorsWriter to write the vectors to the index.
  
  Specified by:
  
  fieldsWriter in class FlatVectorsFormat
  
  Throws:
  
  IOException
- fieldsReader
  
  public FlatVectorsReader fieldsReader(SegmentReadState state) throws IOException
  
  Description copied from class: FlatVectorsFormat
  
  Returns a KnnVectorsReader to read the vectors from the index.
  
  Specified by:
  
  fieldsReader in class FlatVectorsFormat
  
  Throws:
  
  IOException
- getMaxDimensions
  
  public int getMaxDimensions(String fieldName)
  
  Description copied from class: KnnVectorsFormat
  
  Returns the maximum number of vector dimensions supported by this codec for the given field name
  Codecs implement this method to specify the maximum number of dimensions they support.
  
  Overrides:
  
  getMaxDimensions in class FlatVectorsFormat
  
  Parameters:
  
  fieldName - the field name
  
  Returns:
  
  the maximum number of vector dimensions.
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object

Class Lucene102BinaryQuantizedVectorsFormat

.veb (vector data) file

.vemb (vector metadata) file

Field Summary

Fields inherited from class org.apache.lucene.codecs.KnnVectorsFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.codecs.KnnVectorsFormat

Methods inherited from class java.lang.Object

Field Details

QUERY_BITS

INDEX_BITS

BINARIZED_VECTOR_COMPONENT

NAME

Constructor Details

Lucene102BinaryQuantizedVectorsFormat

Method Details

fieldsWriter

fieldsReader

getMaxDimensions

toString