Class KnnVectorDict

java.lang.Object
org.apache.lucene.demo.knn.KnnVectorDict
All Implemented Interfaces:
Closeable, AutoCloseable

public class KnnVectorDict extends Object implements Closeable
Manages a map from token to numeric vector for use with KnnVector indexing and search. The map is stored as an FST: token-to-ordinal plus a dense binary file holding the vectors.
  • Constructor Details

    • KnnVectorDict

      public KnnVectorDict(Directory directory, String dictName) throws IOException
      Sole constructor
      Parameters:
      directory - Lucene directory from which knn directory should be read.
      dictName - the base name of the directory files that store the knn vector dictionary. A file with extension '.bin' holds the vectors and the '.fst' maps tokens to offsets in the '.bin' file.
      Throws:
      IOException
  • Method Details

    • get

      public void get(BytesRef token, byte[] output) throws IOException
      Get the vector corresponding to the given token. NOTE: the returned array is shared and its contents will be overwritten by subsequent calls. The caller is responsible to copy the data as needed.
      Parameters:
      token - the token to look up
      output - the array in which to write the corresponding vector. Its length must be getDimension() * Float.BYTES. It will be filled with zeros if the token is not present in the dictionary.
      Throws:
      IllegalArgumentException - if the output array is incorrectly sized
      IOException - if there is a problem reading the dictionary
    • getDimension

      public int getDimension()
      Get the dimension of the vectors returned by this.
      Returns:
      the vector dimension
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException
    • build

      public static void build(Path gloveInput, Directory directory, String dictName) throws IOException
      Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.
      Parameters:
      gloveInput - the path to the input dictionary. The dictionary is delimited by newlines, and each line is space-delimited. The first column has the token, and the remaining columns are the vector components, as text. The dictionary must be sorted by its leading tokens (considered as bytes).
      directory - a Lucene directory to write the dictionary to.
      dictName - Base name for the knn dictionary files.
      Throws:
      IOException
    • ramBytesUsed

      public long ramBytesUsed()
      Return the size of the dictionary in bytes