org.apache.lucene.codecs.uniformsplit.UniformSplitTermsWriter

All Implemented Interfaces:: Closeable, AutoCloseable

Direct Known Subclasses:: STUniformSplitTermsWriter

public class UniformSplitTermsWriter extends FieldsConsumer

A block-based terms index and dictionary that assigns terms to nearly uniform length blocks. This technique is called Uniform Split.

The block construction is driven by two parameters, targetNumBlockLines and deltaNumLines. Each block size (number of terms) is targetNumBlockLines+-deltaNumLines. The algorithm computes the minimal distinguishing prefix (MDP) between each term and its previous term (alphabetically ordered). Then it selects in the neighborhood of the targetNumBlockLines, and within the deltaNumLines, the term with the minimal MDP. This term becomes the first term of the next block and its MDP is the block key. This block key is added to the terms dictionary trie.

We call dictionary the trie structure in memory, and block file the disk file containing the block lines, with one term and its corresponding term state details per line.

When seeking a term, the dictionary seeks the floor leaf of the trie for the searched term and jumps to the corresponding file pointer in the block file. There, the block terms are scanned for the exact searched term.

The terms inside a block do not need to share a prefix. Only the block key is used to find the block from the dictionary trie. And the block key is selected because it is the locally smallest MDP. This makes the dictionary trie very compact.

An interesting property of the Uniform Split technique is the very linear balance between memory usage and lookup performance. By decreasing the target block size, the block scan becomes faster, and since there are more blocks, the dictionary trie memory usage increases. Additionally, small blocks are faster to read from disk. A good sweet spot for the target block size is 32 with delta of 3 (10%) (default values). This can be tuned in the constructor.

There are additional optimizations:

Each block has a header that allows the lookup to jump directly to the middle term with a fast comparison. This reduces the linear scan by 2 for a small disk size increase.
Each block term is incrementally encoded according to its previous term. This both reduces the disk size and speeds up the block scan.
All term line details (the terms states) are written after all terms. This allows faster term scan without needing to decode the term states.
All file pointers are base-encoded. Their value is relative to the block base file pointer (not to the previous file pointer), this allows to read the term state of any term independently.

Blocks can be compressed or encrypted with an optional BlockEncoder provided in the constructor.

The block file contains all the term blocks for each field sequentially. It also contains the fields metadata at the end of the file.

The dictionary file contains the trie (FST bytes) for each field sequentially.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

protected final BlockEncoder

blockEncoder

protected final IndexOutput

blockOutput

static final int

DEFAULT_DELTA_NUM_LINES

Default value for the maximum allowed delta variation of the block size (delta of the number of terms per block).

static final int

DEFAULT_TARGET_NUM_BLOCK_LINES

Default value for the target block size (number of terms per block).

protected final int

deltaNumLines

protected final IndexOutput

dictionaryOutput

protected final FieldInfos

fieldInfos

protected final FieldMetadata.Serializer

fieldMetadataWriter

protected static final int

MAX_NUM_BLOCK_LINES

Upper limit of the block size (maximum number of terms per block).

protected final int

maxDoc

protected final PostingsWriterBase

postingsWriter

protected final int

targetNumBlockLines
Constructor Summary

Constructors

Modifier

Constructor

Description

UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder)

protected

UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder, FieldMetadata.Serializer fieldMetadataWriter, String codecName, int versionCurrent, String termsBlocksExtension, String dictionaryExtension)

UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, BlockEncoder blockEncoder)
Method Summary

Modifier and Type

Method

Description

void

close()

protected static void

validateSettings(int targetNumBlockLines, int deltaNumLines)

Validates the constructor settings.

void

write(Fields fields, NormsProducer normsProducer)

protected void

writeDictionary(IndexDictionary.Builder dictionaryBuilder)

Writes the dictionary index (FST) to disk.

protected void

writeEncodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput)

protected void

writeFieldsMetadata(int fieldsNumber, ByteBuffersDataOutput fieldsOutput)

protected int

writeFieldTerms(BlockWriter blockWriter, DataOutput fieldsOutput, TermsEnum termsEnum, FieldInfo fieldInfo, NormsProducer normsProducer)

protected BlockTermState

writePostingLine(TermsEnum termsEnum, FieldMetadata fieldMetadata, NormsProducer normsProducer)

Writes the posting values for the current term in the given TermsEnum and updates the FieldMetadata stats.

protected void

writeUnencodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput)

Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_TARGET_NUM_BLOCK_LINES
  
  public static final int DEFAULT_TARGET_NUM_BLOCK_LINES
  
  Default value for the target block size (number of terms per block).
  See Also:
  
  Constant Field Values
- DEFAULT_DELTA_NUM_LINES
  
  public static final int DEFAULT_DELTA_NUM_LINES
  
  Default value for the maximum allowed delta variation of the block size (delta of the number of terms per block). The block size will be [target block size]+-[allowed delta].
  See Also:
  
  Constant Field Values
- MAX_NUM_BLOCK_LINES
  
  protected static final int MAX_NUM_BLOCK_LINES
  
  Upper limit of the block size (maximum number of terms per block).
  See Also:
  
  Constant Field Values
- fieldInfos
  
  protected final FieldInfos fieldInfos
- postingsWriter
  
  protected final PostingsWriterBase postingsWriter
- maxDoc
  
  protected final int maxDoc
- targetNumBlockLines
  
  protected final int targetNumBlockLines
- deltaNumLines
  
  protected final int deltaNumLines
- blockEncoder
  
  protected final BlockEncoder blockEncoder
- fieldMetadataWriter
  
  protected final FieldMetadata.Serializer fieldMetadataWriter
- blockOutput
  
  protected final IndexOutput blockOutput
- dictionaryOutput
  
  protected final IndexOutput dictionaryOutput
Constructor Details
- UniformSplitTermsWriter
  
  public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, BlockEncoder blockEncoder) throws IOException
  
  Parameters:
  
  blockEncoder - Optional block encoder, may be null if none. It can be used for compression or encryption.
  
  Throws:
  
  IOException
- UniformSplitTermsWriter
  
  public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder) throws IOException
  
  Parameters:
  
  blockEncoder - Optional block encoder, may be null if none. It can be used for compression or encryption.
  
  Throws:
  
  IOException
- UniformSplitTermsWriter
  
  protected UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder, FieldMetadata.Serializer fieldMetadataWriter, String codecName, int versionCurrent, String termsBlocksExtension, String dictionaryExtension) throws IOException
  
  Parameters:
  
  targetNumBlockLines - Target number of lines per block. Must be strictly greater than 0. The parameters can be pre-validated with validateSettings(int, int). There is one term per block line, with its corresponding details (TermState).
  
  deltaNumLines - Maximum allowed delta variation of the number of lines per block. Must be greater than or equal to 0 and strictly less than targetNumBlockLines. The block size will be targetNumBlockLines+-deltaNumLines. The block size must always be less than or equal to MAX_NUM_BLOCK_LINES.
  
  blockEncoder - Optional block encoder, may be null if none. It can be used for compression or encryption.
  
  Throws:
  
  IOException
Method Details
- validateSettings
  
  protected static void validateSettings(int targetNumBlockLines, int deltaNumLines)
  
  Validates the constructor settings.
  
  Parameters:
  
  targetNumBlockLines - Target number of lines per block. Must be strictly greater than 0.
  
  deltaNumLines - Maximum allowed delta variation of the number of lines per block. Must be greater than or equal to 0 and strictly less than targetNumBlockLines. Additionally, targetNumBlockLines + deltaNumLines must be less than or equal to MAX_NUM_BLOCK_LINES.
- write
  
  public void write(Fields fields, NormsProducer normsProducer) throws IOException
  
  Specified by:
  
  write in class FieldsConsumer
  
  Throws:
  
  IOException
- writeFieldsMetadata
  
  protected void writeFieldsMetadata(int fieldsNumber, ByteBuffersDataOutput fieldsOutput) throws IOException
  
  Throws:
  
  IOException
- writeUnencodedFieldsMetadata
  
  protected void writeUnencodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) throws IOException
  
  Throws:
  
  IOException
- writeEncodedFieldsMetadata
  
  protected void writeEncodedFieldsMetadata(ByteBuffersDataOutput fieldsOutput) throws IOException
  
  Throws:
  
  IOException
- writeFieldTerms
  
  protected int writeFieldTerms(BlockWriter blockWriter, DataOutput fieldsOutput, TermsEnum termsEnum, FieldInfo fieldInfo, NormsProducer normsProducer) throws IOException
  
  Returns:
  
  1 if the field was written; 0 otherwise.
  
  Throws:
  
  IOException
- writePostingLine
  
  protected BlockTermState writePostingLine(TermsEnum termsEnum, FieldMetadata fieldMetadata, NormsProducer normsProducer) throws IOException
  
  Writes the posting values for the current term in the given TermsEnum and updates the FieldMetadata stats.
  
  Returns:
  
  the written BlockTermState; or null if none.
  
  Throws:
  
  IOException
- writeDictionary
  
  protected void writeDictionary(IndexDictionary.Builder dictionaryBuilder) throws IOException
  
  Writes the dictionary index (FST) to disk.
  
  Throws:
  
  IOException
- close
  
  public void close() throws IOException
  
  Specified by:
  
  close in interface AutoCloseable
  
  Specified by:
  
  close in interface Closeable
  
  Specified by:
  
  close in class FieldsConsumer
  
  Throws:
  
  IOException

Class UniformSplitTermsWriter

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.codecs.FieldsConsumer

Methods inherited from class java.lang.Object

Field Details

DEFAULT_TARGET_NUM_BLOCK_LINES

DEFAULT_DELTA_NUM_LINES

MAX_NUM_BLOCK_LINES

fieldInfos

postingsWriter

maxDoc

targetNumBlockLines

deltaNumLines

blockEncoder

fieldMetadataWriter

blockOutput

dictionaryOutput

Constructor Details

UniformSplitTermsWriter

UniformSplitTermsWriter

UniformSplitTermsWriter

Method Details

validateSettings

write

writeFieldsMetadata

writeUnencodedFieldsMetadata

writeEncodedFieldsMetadata

writeFieldTerms

writePostingLine

writeDictionary

close