org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter

All Implemented Interfaces:: Closeable, AutoCloseable

public final class Lucene90BlockTreeTermsWriter extends FieldsConsumer

Block-based terms index and dictionary writer.

Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

Files:

.tim: Term Dictionary
.tmd: Term Metadata
.tip: Term Index

Term Dictionary

The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.

TermsDict (.tim) --> Header, PostingsHeader, NodeBlock^NumBlocks, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, Byte^SuffixLength, StatsLength, < TermStats >^EntryCount, MetaLength, <TermMetadata>^EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], Byte^SuffixLength, StatsLength, < TermStats ? >^EntryCount, MetaLength, <TermMetadata ? >^EntryCount
TermStats --> DocFreq, TotalTermFreq
Header --> CodecHeader
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength --> VInt
TotalTermFreq --> VLong
Footer --> CodecFooter

Notes:

Header is a CodecHeader storing the version information for the BlockTree implementation.
DocFreq is the count of documents which contain the term.
TotalTermFreq is the total number of occurrences of the term. This is encoded as the difference between the total number of occurrences and the DocFreq.
PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted.

Term Metadata

The .tmd file contains the list of term metadata (such as FST index metadata) and field level statistics (such as sum of total term freq).

TermsMeta (.tmd) --> Header, NumFields, <FieldStats>^NumFields, TermIndexLength, TermDictLength, Footer
FieldStats --> FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader, FSTMetadata
Header,FSTHeader --> CodecHeader
TermIndexLength, TermDictLength --> Uint64
MinTerm,MaxTerm --> VInt length followed by the byte[]
NumFields,FieldNumber,RootCodeLength,DocCount --> VInt
NumTerms,SumTotalTermFreq,SumDocFreq,IndexStartFP --> VLong
Footer --> CodecFooter

Notes:

FieldNumber is the fields number from FieldInfos. (.fnm)
NumTerms is the number of unique terms for the field.
RootCode points to the root block for the field.
SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
DocCount is the number of documents that have at least one posting for this field.
MinTerm, MaxTerm are the lowest and highest term in this field.

Term Index

The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.

TermsIndex (.tip) --> Header, FSTIndex^NumFieldsFooter
Header --> CodecHeader
FSTIndex --> FST<byte[]>
Footer --> CodecFooter

Notes:

The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

See Also:

Lucene90BlockTreeTermsReader

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_MAX_BLOCK_SIZE

Suggested default value for the maxItemsInBlock parameter to Lucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int).

static final int

DEFAULT_MIN_BLOCK_SIZE

Suggested default value for the minItemsInBlock parameter to Lucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int).
Constructor Summary

Constructors

Constructor

Description

Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock)

Create a new writer.
Method Summary

Modifier and Type

Method

Description

void

close()

static void

validateSettings(int minItemsInBlock, int maxItemsInBlock)

Throws IllegalArgumentException if any of these settings is invalid.

void

write(Fields fields, NormsProducer norms)

Write all fields, terms and postings.

Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_MIN_BLOCK_SIZE
  
  public static final int DEFAULT_MIN_BLOCK_SIZE
  
  Suggested default value for the minItemsInBlock parameter to Lucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int).
  See Also:
  
  Constant Field Values
- DEFAULT_MAX_BLOCK_SIZE
  
  public static final int DEFAULT_MAX_BLOCK_SIZE
  
  Suggested default value for the maxItemsInBlock parameter to Lucene90BlockTreeTermsWriter(SegmentWriteState,PostingsWriterBase,int,int).
  See Also:
  
  Constant Field Values
Constructor Details
- Lucene90BlockTreeTermsWriter
  
  public Lucene90BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock) throws IOException
  
  Create a new writer. The number of items (terms or sub-blocks) per block will aim to be between minItemsPerBlock and maxItemsPerBlock, though in some cases the blocks may be smaller than the min.
  
  Throws:
  
  IOException
Method Details
- validateSettings
  
  public static void validateSettings(int minItemsInBlock, int maxItemsInBlock)
  
  Throws IllegalArgumentException if any of these settings is invalid.
- write
  
  public void write(Fields fields, NormsProducer norms) throws IOException
  
  Description copied from class: FieldsConsumer
  Write all fields, terms and postings. This the "pull" API, allowing you to iterate more than once over the postings, somewhat analogous to using a DOM API to traverse an XML tree.
  Notes:
  
  You must compute index statistics, including each Term's docFreq and totalTermFreq, as well as the summary sumTotalTermFreq, sumTotalDocFreq and docCount.
  You must skip terms that have no docs and fields that have no terms, even though the provided Fields API will expose them; this typically requires lazily writing the field or term until you've actually seen the first term or document.
  The provided Fields instance is limited: you cannot call any methods that return statistics/counts; you cannot pass a non-null live docs when pulling docs/positions enums.
  Specified by:
  
  write in class FieldsConsumer
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Specified by:
  
  close in class FieldsConsumer
  
  Throws:
  
  IOException

Class Lucene90BlockTreeTermsWriter

Term Dictionary

Term Metadata

Term Index

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.codecs.FieldsConsumer

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MIN_BLOCK_SIZE

DEFAULT_MAX_BLOCK_SIZE

Constructor Details

Lucene90BlockTreeTermsWriter

Method Details

validateSettings

write

close