public final class Lucene84PostingsFormat extends PostingsFormat
Basic idea:
In packed blocks, integers are encoded with the same bit width (packed format
):
the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
that are all the same value are encoded in an optimized way.
In VInt blocks, integers are encoded as VInt
:
the block size is variable.
When the postings are long enough, Lucene84PostingsFormat will try to encode most integer data as a packed block.
Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed blocks, while the remaining 3 are encoded as one VInt block.
Different kinds of data are always encoded separately into different packed blocks, but may possibly be interleaved into the same VInt block.
This strategy is applied to pairs: <document number, frequency>, <position, payload length>, <position, offset start, offset length>, and <position, payload length, offsetstart, offset length>.
The structure of skip table is quite similar to previous version of Lucene. Skip interval is the same as block size, and each skip entry points to the beginning of each block. However, for the first block, skip data is omitted.
A position is an integer indicating where the term occurs within one document. A payload is a blob of metadata associated with current position. An offset is a pair of integers indicating the tokenized start/end offsets for given term in current position: it is essentially a specialized payload.
When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a null payload contributes one count). As mentioned in block structure, it is possible to encode these three either combined or separately.
In all cases, payloads and offsets are stored together. When encoded as a packed block, position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are stored interleaved into the .pos (so is payload metadata).
With this strategy, the majority of payload and offset data will be outside .pos file. So for queries that require only position data, running on a full index with payloads and offsets, this reduces disk pre-fetches.
Files and detailed format:
The .tim file contains the list of terms in each
field along with per-term statistics (such as docfreq)
and pointers to the frequencies, positions, payload and
skip data in the .doc, .pos, and .pay files.
See BlockTreeTermsWriter
for more details on the format.
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the PostingsHeader and TermMetadata sections described here:
IndexHeader
VInt
VLong
CodecFooter
Notes:
IndexHeader
storing the version information
for the postings.DocIdSetIterator.advance(int)
.
The .tip file contains an index into the term dictionary, so that it can be
accessed randomly. See BlockTreeTermsWriter
for more details on the format.
The .doc file contains the lists of documents which contain each term, along
with the frequency of the term in that document (except when frequencies are
omitted: IndexOptions.DOCS
). It also saves skip data to the beginning of
each packed or VInt block, when the length of document list is larger than packed block size.
IndexHeader
PackedInts
VInt
VLong
CodecFooter
Notes:
DocDelta: if frequencies are indexed, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If frequencies are omitted, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with frequencies indexed, would be the following sequence of VInts:
15, 8, 3
If frequencies were omitted (IndexOptions.DOCS
) it would be this
sequence of VInts instead:
7,4
MultiLevelSkipListWriter
, skip data is assumed to be saved for
skipIntervalth, 2*skipIntervalth ... posting in the list. However,
in Lucene84PostingsFormat, the skip data is saved for skipInterval+1th,
2*skipInterval+1th ... posting (skipInterval==PackedBlockSize in this case).
When DocFreq is multiple of PackedBlockSize, MultiLevelSkipListWriter will expect one
more skip data than Lucene84SkipWriter. The .pos file contains the lists of positions that each term occurs at within documents. It also sometimes stores part of payloads and offsets for speedup.
IndexHeader
PackedInts
VInt
byte
PayLengthCodecFooter
Notes:
4, 5, 4
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
.The .pay file will store payloads and offsets associated with certain term-document positions. Some payloads and offsets will be separated out into .pos file, for performance reasons.
IndexHeader
PackedInts
VInt
byte
SumPayLengthCodecFooter
Notes:
Modifier and Type | Class and Description |
---|---|
static class |
Lucene84PostingsFormat.IntBlockTermState
Holds all state required for
Lucene84PostingsReader to produce a
PostingsEnum without re-seeking the terms dict. |
Modifier and Type | Field and Description |
---|---|
static int |
BLOCK_SIZE
Size of blocks.
|
static String |
DOC_EXTENSION
Filename extension for document number, frequencies, and skip data.
|
static String |
PAY_EXTENSION
Filename extension for payloads and offsets.
|
static String |
POS_EXTENSION
Filename extension for positions.
|
EMPTY
Constructor and Description |
---|
Lucene84PostingsFormat()
Creates
Lucene84PostingsFormat with default
settings. |
Lucene84PostingsFormat(int minTermBlockSize,
int maxTermBlockSize)
Creates
Lucene84PostingsFormat with custom
values for minBlockSize and maxBlockSize passed to block terms dictionary. |
Modifier and Type | Method and Description |
---|---|
FieldsConsumer |
fieldsConsumer(SegmentWriteState state)
Writes a new segment
|
FieldsProducer |
fieldsProducer(SegmentReadState state)
Reads a segment.
|
String |
toString() |
availablePostingsFormats, forName, getName, reloadPostingsFormats
public static final String DOC_EXTENSION
public static final String POS_EXTENSION
public static final String PAY_EXTENSION
public static final int BLOCK_SIZE
public Lucene84PostingsFormat()
Lucene84PostingsFormat
with default
settings.public Lucene84PostingsFormat(int minTermBlockSize, int maxTermBlockSize)
Lucene84PostingsFormat
with custom
values for minBlockSize
and maxBlockSize
passed to block terms dictionary.public String toString()
toString
in class PostingsFormat
public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException
PostingsFormat
fieldsConsumer
in class PostingsFormat
IOException
public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException
PostingsFormat
fieldsProducer
in class PostingsFormat
IOException
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.