Lucene40PostingsFormat (Lucene 4.0.0 API)

java.lang.Object
- org.apache.lucene.codecs.PostingsFormat
- - org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat

All Implemented Interfaces:

NamedSPILoader.NamedSPI
```
public final class Lucene40PostingsFormat
extends PostingsFormat
```
Lucene 4.0 Postings format.
Files:
- .tim: Term Dictionary
- .tip: Term Index
- .frq: Frequencies
- .prx: Positions
Term Dictionary

The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and pointers to the frequencies, positions and skip data in the .frq and .prx files.

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block. It's written by BlockTreeTermsWriter and read by BlockTreeTermsReader.

NOTE: The term dictionary can plug into different postings implementations: for example the postings writer/reader are actually responsible for encoding and decoding the MetadataBlock.
- TermsDict (.tim) --> Header, DirOffset, PostingsHeader, SkipInterval, MaxSkipLevels, SkipMinimum, Block^NumBlocks, FieldSummary
- Block --> SuffixBlock, StatsBlock, MetadataBlock
- SuffixBlock --> EntryCount, SuffixLength, Byte^SuffixLength
- StatsBlock --> StatsLength, <DocFreq, TotalTermFreq>^EntryCount
- MetadataBlock --> MetaLength, <FreqDelta, SkipDelta?, ProxDelta?>^EntryCount
- FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumDocFreq, DocCount>^NumFields
- Header,PostingsHeader --> CodecHeader
- DirOffset --> Uint64
- SkipInterval,MaxSkipLevels,SkipMinimum --> Uint32
- EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,SkipDelta,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt
- TotalTermFreq,FreqDelta,ProxDelta,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong
Notes:
- Header is a CodecHeader storing the version information for the BlockTree implementation. On the other hand, PostingsHeader stores the version information for the postings reader/writer.
- DirOffset is a pointer to the FieldSummary section.
- SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate DocIdSetIterator.advance(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases.
- MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slightly larger indexes but greater acceleration. See format of .frq file for more information about skip levels.
- SkipMinimum is the minimum document frequency a term must have in order to write any skip data at all.
- DocFreq is the count of documents which contain the term.
- TotalTermFreq is the total number of occurrences of the term. This is encoded as the difference between the total number of occurrences and the DocFreq.
- FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the block).
- ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the block. For fields that omit position data, this will be 0 since prox information is not stored.
- SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipMinimum.
- FieldNumber is the fields number from FieldInfos. (.fnm)
- NumTerms is the number of unique terms for the field.
- RootCode points to the root block for the field.
- SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
- DocCount is the number of documents that have at least one posting for this field.
Term Index

The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.
- TermsIndex (.tip) --> Header, <IndexStartFP>^NumFields, FSTIndex^NumFields
- Header --> CodecHeader
- IndexStartFP --> VLong
- FSTIndex --> FST<byte[]>
Notes:
- The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
- It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.
Frequencies

The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (except when frequencies are omitted: FieldInfo.IndexOptions.DOCS_ONLY).
- FreqFile (.frq) --> Header, <TermFreqs, SkipData?> ^TermCount
- Header --> CodecHeader
- TermFreqs --> <TermFreq> ^DocFreq
- TermFreq --> DocDelta[, Freq?]
- SkipData --> <<SkipLevelLength, SkipLevel> ^{NumSkipLevels-1}, SkipLevel> <SkipDatum>
- SkipLevel --> <SkipDatum> ^{DocFreq/(SkipInterval^(Level +
  1))}
- SkipDatum --> DocSkip,PayloadLength?,OffsetLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
- DocDelta,Freq,DocSkip,PayloadLength,OffsetLength,FreqSkip,ProxSkip --> VInt
- SkipChildLevelPointer --> VLong
TermFreqs are ordered by term (the term is implicit, from the term dictionary).

TermFreq entries are ordered by increasing document number.

DocDelta: if frequencies are indexed, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If frequencies are omitted, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored.

For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with frequencies indexed, would be the following sequence of VInts:

15, 8, 3

If frequencies were omitted (FieldInfo.IndexOptions.DOCS_ONLY) it would be this sequence of VInts instead:

7,4

DocSkip records the document number before every SkipInterval ^th document in TermFreqs. If payloads and offsets are disabled for the term's field, then DocSkip represents the difference from the previous value in the sequence. If payloads and/or offsets are enabled for the term's field, then DocSkip/2 represents the difference from the previous value in the sequence. In this case when DocSkip is odd, then PayloadLength and/or OffsetLength are stored indicating the length of the last payload/offset before the SkipInterval^th document in TermPositions.

PayloadLength indicates the length of the last payload.

OffsetLength indicates the length of the last offset (endOffset-startOffset).

FreqSkip and ProxSkip record the position of every SkipInterval ^th entry in FreqFile and ProxFile, respectively. File positions are relative to the start of TermFreqs and Positions, to the previous SkipDatum in the sequence.

For example, if DocFreq=35 and SkipInterval=16, then there are two SkipData entries, containing the 15 ^th and 31 ^st document numbers in TermFreqs. The first FreqSkip names the number of bytes after the beginning of TermFreqs that the 16 ^th SkipDatum starts, and the second the number of bytes after that that the 32 ^nd starts. The first ProxSkip names the number of bytes after the beginning of Positions that the 16 ^th SkipDatum starts, and the second the number of bytes after that that the 32 ^nd starts.

Each term can have multiple skip levels. The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip level is Level=0.
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3^rd, 7^th, 11^th, 15^th, 19^th, 23^rd, 27^th, and 31^st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15^th and 31^st document numbers in TermFreqs.
The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0.

Positions

The .prx file contains the lists of positions that each term occurs at within documents. Note that fields omitting positional data do not store anything into this file, and if all fields in the index omit positional data then the .prx file will not exist.
- ProxFile (.prx) --> Header, <TermPositions> ^TermCount
- Header --> CodecHeader
- TermPositions --> <Positions> ^DocFreq
- Positions --> <PositionDelta,PayloadLength?,OffsetDelta?,OffsetLength?,PayloadData?> ^Freq
- PositionDelta,OffsetDelta,OffsetLength,PayloadLength --> VInt
- PayloadData --> byte^{PayloadLength}
TermPositions are ordered by term (the term is implicit, from the term dictionary).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position.

For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of VInts (payloads disabled):

4, 5, 4

PayloadData is metadata associated with the current term position. If PayloadLength is stored at the current position, then it indicates the length of this payload. If PayloadLength is not stored, then this payload has the same length as the payload at the previous position.

OffsetDelta/2 is the difference between this position's startOffset from the previous occurrence (or zero, if this is the first occurrence in this document). If OffsetDelta is odd, then the length (endOffset-startOffset) differs from the previous occurrence and an OffsetLength follows. Offset data is only written for FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS.
WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary
- Fields inherited from class org.apache.lucene.codecs.PostingsFormat
  EMPTY

Constructor Summary

Constructors
Constructor and Description
`Lucene40PostingsFormat()` Creates `Lucene40PostingsFormat` with default settings.
`Lucene40PostingsFormat(int minBlockSize, int maxBlockSize)` Creates `Lucene40PostingsFormat` with custom values for `minBlockSize` and `maxBlockSize` passed to block terms dictionary.

Method Summary

Methods
Modifier and Type	Method and Description
`FieldsConsumer`	`fieldsConsumer(SegmentWriteState state)` Writes a new segment
`FieldsProducer`	`fieldsProducer(SegmentReadState state)` Reads a segment.
`String`	`toString()`

Methods inherited from class org.apache.lucene.codecs.PostingsFormat
availablePostingsFormats, forName, getName, reloadPostingsFormats

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Constructor Detail
  - Lucene40PostingsFormat
```
public Lucene40PostingsFormat()
```
    Creates Lucene40PostingsFormat with default settings.
  - Lucene40PostingsFormat
```
public Lucene40PostingsFormat(int minBlockSize,
                      int maxBlockSize)
```
    Creates Lucene40PostingsFormat with custom values for minBlockSize and maxBlockSize passed to block terms dictionary.
    
    See Also:
    BlockTreeTermsWriter.BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int)
- Method Detail
  - fieldsConsumer
```
public FieldsConsumer fieldsConsumer(SegmentWriteState state)
                              throws IOException
```
    Description copied from class: PostingsFormat
    
    Writes a new segment
    
    Specified by:
    
    fieldsConsumer in class PostingsFormat
    
    Throws:
    
    IOException
  - fieldsProducer
```
public FieldsProducer fieldsProducer(SegmentReadState state)
                              throws IOException
```
    Description copied from class: PostingsFormat
    
    Reads a segment. NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments.
    
    Specified by:
    
    fieldsProducer in class PostingsFormat
    
    Throws:
    
    IOException
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class PostingsFormat

Class Lucene40PostingsFormat

Term Dictionary

Term Index

Frequencies

Positions

Field Summary

Fields inherited from class org.apache.lucene.codecs.PostingsFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.codecs.PostingsFormat

Methods inherited from class java.lang.Object

Constructor Detail

Lucene40PostingsFormat

Lucene40PostingsFormat

Method Detail

fieldsConsumer

fieldsProducer

toString