Lucene 4.1 stored fields format.
Principle
This StoredFieldsFormat
compresses blocks of 16KB of documents in
order to improve the compression ratio compared to document-level
compression. It uses the LZ4
compression algorithm, which is fast to compress and very fast to decompress
data. Although the compression method that is used focuses more on speed
than on compression ratio, it should provide interesting compression ratios
for redundant inputs (such as log files, HTML or plain text).
File formats
Stored fields are represented by two files:
-
A fields data file (extension .fdt). This file stores a compact
representation of documents in compressed blocks of 16KB or more. When
writing a segment, documents are appended to an in-memory byte[]
buffer. When its size reaches 16KB or more, some metadata about the documents
is flushed to disk, immediately followed by a compressed representation of
the buffer using the
LZ4
compression format.
Here is a more detailed description of the field data file format:
- FieldData (.fdt) --> <Header>, PackedIntsVersion, CompressionFormat, <Chunk>ChunkCount
- Header -->
CodecHeader
- PackedIntsVersion -->
PackedInts.VERSION_CURRENT
as a VInt
- ChunkCount is not known in advance and is the number of chunks necessary to store all document of the segment
- Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDocs>
- DocBase --> the ID of the first document of the chunk as a
VInt
- ChunkDocs --> the number of documents in the chunk as a
VInt
- DocFieldCounts --> the number of stored fields of every document in the chunk, encoded as followed:
- if chunkDocs=1, the unique value is encoded as a
VInt
- else read a
VInt
(let's call it bitsRequired)
- if bitsRequired is 0 then all values are equal, and the common value is the following
VInt
- else bitsRequired is the number of bits required to store any value, and values are stored in a
packed
array where every value is stored on exactly bitsRequired bits
- DocLengths --> the lengths of all documents in the chunk, encoded with the same method as DocFieldCounts
- CompressedDocs --> a compressed representation of <Docs> using the LZ4 compression format
- Docs --> <Doc>ChunkDocs
- Doc --> <FieldNumAndType, Value>DocFieldCount
- FieldNumAndType --> a
VLong
, whose 3 last bits are Type and other bits are FieldNum
- Type -->
- 0: Value is String
- 1: Value is BinaryValue
- 2: Value is Int
- 3: Value is Float
- 4: Value is Long
- 5: Value is Double
- 6, 7: unused
- FieldNum --> an ID of the field
- Value -->
String
| BinaryValue | Int | Float | Long | Double depending on Type
- BinaryValue --> ValueLength <Byte>ValueLength
Notes
- If documents are larger than 16KB then chunks will likely contain only
one document. However, documents can never spread across several chunks (all
fields of a single document are in the same chunk).
- Given that the original lengths are written in the metadata of the chunk,
the decompressor can leverage this information to stop decoding as soon as
enough data has been decompressed.
- In case documents are incompressible, CompressedDocs will be less than
0.5% larger than Docs.
-
A fields index file (extension .fdx). The data stored in this
file is read to load an in-memory data-structure that can be used to locate
the start offset of a block containing any document in the fields data file.
In order to have a compact in-memory representation, for every block of
1024 chunks, this stored fields index computes the average number of bytes per
chunk and for every chunk, only stores the difference between
- ${chunk number} * ${average length of a chunk}
- and the actual start offset of the chunk
Data is written as follows:
- FieldsIndex (.fdx) --> <Header>, FieldsIndex, PackedIntsVersion, <Block>BlockCount, BlocksEndMarker
- Header -->
CodecHeader
- PackedIntsVersion -->
PackedInts.VERSION_CURRENT
as a VInt
- BlocksEndMarker --> 0 as a
VInt
, this marks the end of blocks since blocks are not allowed to start with 0
- Block --> BlockChunks, <DocBases>, <StartPointers>
- BlockChunks --> a
VInt
which is the number of chunks encoded in the block
- DocBases --> DocBase, AvgChunkDocs, BitsPerDocBaseDelta, DocBaseDeltas
- DocBase --> first document ID of the block of chunks, as a
VInt
- AvgChunkDocs --> average number of documents in a single chunk, as a
VInt
- BitsPerDocBaseDelta --> number of bits required to represent a delta from the average using ZigZag encoding
- DocBaseDeltas -->
packed
array of BlockChunks elements of BitsPerDocBaseDelta bits each, representing the deltas from the average doc base using ZigZag encoding.
- StartPointers --> StartPointerBase, AvgChunkSize, BitsPerStartPointerDelta, StartPointerDeltas
- StartPointerBase --> the first start pointer of the block, as a
VLong
- AvgChunkSize --> the average size of a chunk of compressed documents, as a
VLong
- BitsPerStartPointerDelta --> number of bits required to represent a delta from the average using ZigZag encoding
- StartPointerDeltas -->
packed
array of BlockChunks elements of BitsPerStartPointerDelta bits each, representing the deltas from the average start pointer using ZigZag encoding
Notes
- For any block, the doc base of the n-th chunk can be restored with
DocBase + AvgChunkDocs * n + DocBaseDeltas[n]
.
- For any block, the start pointer of the n-th chunk can be restored with
StartPointerBase + AvgChunkSize * n + StartPointerDeltas[n]
.
- Once data is loaded into memory, you can lookup the start pointer of any
document by performing two binary searches: a first one based on the values
of DocBase in order to find the right block, and then inside the block based
on DocBaseDeltas (by reconstructing the doc bases for every chunk).
Known limitations
This StoredFieldsFormat
does not support individual documents
larger than (231 - 214) bytes. In case this
is a problem, you should use another format, such as
Lucene40StoredFieldsFormat
.