public final class Lucene50PostingsFormat extends PostingsFormat
Basic idea:
In packed blocks, integers are encoded with the same bit width (packed format):
      the block size (i.e. number of integers inside block) is fixed (currently 128). Additionally blocks
      that are all the same value are encoded in an optimized way.
In VInt blocks, integers are encoded as VInt:
      the block size is variable.
When the postings are long enough, Lucene50PostingsFormat will try to encode most integer data as a packed block.
Take a term with 259 documents as an example, the first 256 document ids are encoded as two packed blocks, while the remaining 3 are encoded as one VInt block.
Different kinds of data are always encoded separately into different packed blocks, but may possibly be interleaved into the same VInt block.
This strategy is applied to pairs: <document number, frequency>, <position, payload length>, <position, offset start, offset length>, and <position, payload length, offsetstart, offset length>.
The structure of skip table is quite similar to previous version of Lucene. Skip interval is the same as block size, and each skip entry points to the beginning of each block. However, for the first block, skip data is omitted.
A position is an integer indicating where the term occurs within one document. A payload is a blob of metadata associated with current position. An offset is a pair of integers indicating the tokenized start/end offsets for given term in current position: it is essentially a specialized payload.
When payloads and offsets are not omitted, numPositions==numPayloads==numOffsets (assuming a null payload contributes one count). As mentioned in block structure, it is possible to encode these three either combined or separately.
In all cases, payloads and offsets are stored together. When encoded as a packed block, position data is separated out as .pos, while payloads and offsets are encoded in .pay (payload metadata will also be stored directly in .pay). When encoded as VInt blocks, all these three are stored interleaved into the .pos (so is payload metadata).
With this strategy, the majority of payload and offset data will be outside .pos file. So for queries that require only position data, running on a full index with payloads and offsets, this reduces disk pre-fetches.
Files and detailed format:
The .tim file contains the list of terms in each
 field along with per-term statistics (such as docfreq)
 and pointers to the frequencies, positions, payload and
 skip data in the .doc, .pos, and .pay files.
 See BlockTreeTermsWriter for more details on the format.
 
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the PostingsHeader and TermMetadata sections described here:
IndexHeaderVIntVLongCodecFooterNotes:
IndexHeader storing the version information
        for the postings.DocIdSetIterator.advance(int).
    The .tip file contains an index into the term dictionary, so that it can be 
 accessed randomly.  See BlockTreeTermsWriter for more details on the format.
 
The .doc file contains the lists of documents which contain each term, along
 with the frequency of the term in that document (except when frequencies are
 omitted: IndexOptions.DOCS). It also saves skip data to the beginning of 
 each packed or VInt block, when the length of document list is larger than packed block size.
IndexHeaderPackedIntsVIntVLongCodecFooterNotes:
DocDelta: if frequencies are indexed, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If frequencies are omitted, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with frequencies indexed, would be the following sequence of VInts:
15, 8, 3
If frequencies were omitted (IndexOptions.DOCS) it would be this
          sequence of VInts instead:
7,4
MultiLevelSkipListWriter, skip data is assumed to be saved for
       skipIntervalth, 2*skipIntervalth ... posting in the list. However, 
       in Lucene50PostingsFormat, the skip data is saved for skipInterval+1th, 
       2*skipInterval+1th ... posting (skipInterval==PackedBlockSize in this case). 
       When DocFreq is multiple of PackedBlockSize, MultiLevelSkipListWriter will expect one 
       more skip data than Lucene50SkipWriter. The .pos file contains the lists of positions that each term occurs at within documents. It also sometimes stores part of payloads and offsets for speedup.
IndexHeaderPackedIntsVIntbytePayLengthCodecFooterNotes:
4, 5, 4
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.The .pay file will store payloads and offsets associated with certain term-document positions. Some payloads and offsets will be separated out into .pos file, for performance reasons.
IndexHeaderPackedIntsVIntbyteSumPayLengthCodecFooterNotes:
| Modifier and Type | Field and Description | 
|---|---|
| static int | BLOCK_SIZEFixed packed block size, number of integers encoded in 
 a single packed block. | 
| static String | DOC_EXTENSIONFilename extension for document number, frequencies, and skip data. | 
| static String | PAY_EXTENSIONFilename extension for payloads and offsets. | 
| static String | POS_EXTENSIONFilename extension for positions. | 
EMPTY| Constructor and Description | 
|---|
| Lucene50PostingsFormat()Creates  Lucene50PostingsFormatwith default
  settings. | 
| Lucene50PostingsFormat(int minTermBlockSize,
                      int maxTermBlockSize)Creates  Lucene50PostingsFormatwith custom
  values forminBlockSizeandmaxBlockSizepassed to block terms dictionary. | 
| Modifier and Type | Method and Description | 
|---|---|
| FieldsConsumer | fieldsConsumer(SegmentWriteState state)Writes a new segment | 
| FieldsProducer | fieldsProducer(SegmentReadState state)Reads a segment. | 
| String | toString() | 
availablePostingsFormats, forName, getName, reloadPostingsFormatspublic static final String DOC_EXTENSION
public static final String POS_EXTENSION
public static final String PAY_EXTENSION
public static final int BLOCK_SIZE
public Lucene50PostingsFormat()
Lucene50PostingsFormat with default
  settings.public Lucene50PostingsFormat(int minTermBlockSize,
                              int maxTermBlockSize)
Lucene50PostingsFormat with custom
  values for minBlockSize and maxBlockSize passed to block terms dictionary.public String toString()
toString in class PostingsFormatpublic FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException
PostingsFormatfieldsConsumer in class PostingsFormatIOExceptionpublic FieldsProducer fieldsProducer(SegmentReadState state) throws IOException
PostingsFormatfieldsProducer in class PostingsFormatIOExceptionCopyright © 2000-2018 Apache Software Foundation. All Rights Reserved.