Package org.apache.lucene.demo.knn
Class KnnVectorDict
java.lang.Object
org.apache.lucene.demo.knn.KnnVectorDict
- All Implemented Interfaces:
Closeable
,AutoCloseable
Manages a map from token to numeric vector for use with KnnVector indexing and search. The map is
stored as an FST: token-to-ordinal plus a dense binary file holding the vectors.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.void
close()
void
Get the vector corresponding to the given token.int
Get the dimension of the vectors returned by this.long
Return the size of the dictionary in bytes
-
Constructor Details
-
KnnVectorDict
Sole constructor- Parameters:
directory
- Lucene directory from which knn directory should be read.dictName
- the base name of the directory files that store the knn vector dictionary. A file with extension '.bin' holds the vectors and the '.fst' maps tokens to offsets in the '.bin' file.- Throws:
IOException
-
-
Method Details
-
get
Get the vector corresponding to the given token. NOTE: the returned array is shared and its contents will be overwritten by subsequent calls. The caller is responsible to copy the data as needed.- Parameters:
token
- the token to look upoutput
- the array in which to write the corresponding vector. Its length must begetDimension()
*Float.BYTES
. It will be filled with zeros if the token is not present in the dictionary.- Throws:
IllegalArgumentException
- if the output array is incorrectly sizedIOException
- if there is a problem reading the dictionary
-
getDimension
Get the dimension of the vectors returned by this.- Returns:
- the vector dimension
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Throws:
IOException
-
build
Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.- Parameters:
gloveInput
- the path to the input dictionary. The dictionary is delimited by newlines, and each line is space-delimited. The first column has the token, and the remaining columns are the vector components, as text. The dictionary must be sorted by its leading tokens (considered as bytes).directory
- a Lucene directory to write the dictionary to.dictName
- Base name for the knn dictionary files.- Throws:
IOException
-
ramBytesUsed
Return the size of the dictionary in bytes
-