Class KnnVectorDict

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class KnnVectorDict
    extends Object
    implements Closeable
    Manages a map from token to numeric vector for use with KnnVector indexing and search. The map is stored as an FST: token-to-ordinal plus a dense binary file holding the vectors.
    • Constructor Detail

      • KnnVectorDict

        public KnnVectorDict​(Directory directory,
                             String dictName)
                      throws IOException
        Sole constructor
        directory - Lucene directory from which knn directory should be read.
        dictName - the base name of the directory files that store the knn vector dictionary. A file with extension '.bin' holds the vectors and the '.fst' maps tokens to offsets in the '.bin' file.
    • Method Detail

      • get

        public void get​(BytesRef token,
                        byte[] output)
                 throws IOException
        Get the vector corresponding to the given token. NOTE: the returned array is shared and its contents will be overwritten by subsequent calls. The caller is responsible to copy the data as needed.
        token - the token to look up
        output - the array in which to write the corresponding vector. Its length must be getDimension() * Float.BYTES. It will be filled with zeros if the token is not present in the dictionary.
        IllegalArgumentException - if the output array is incorrectly sized
        IOException - if there is a problem reading the dictionary
      • getDimension

        public int getDimension()
        Get the dimension of the vectors returned by this.
        the vector dimension
      • build

        public static void build​(Path gloveInput,
                                 Directory directory,
                                 String dictName)
                          throws IOException
        Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.
        gloveInput - the path to the input dictionary. The dictionary is delimited by newlines, and each line is space-delimited. The first column has the token, and the remaining columns are the vector components, as text. The dictionary must be sorted by its leading tokens (considered as bytes).
        directory - a Lucene directory to write the dictionary to.
        dictName - Base name for the knn dictionary files.
      • ramBytesUsed

        public long ramBytesUsed()
        Return the size of the dictionary in bytes