Class UniformSplitTermsWriter
- All Implemented Interfaces:
Closeable
,AutoCloseable
- Direct Known Subclasses:
STUniformSplitTermsWriter
The block construction is driven by two parameters, targetNumBlockLines
and
deltaNumLines
. Each block size (number of terms) is targetNumBlockLines
+-
deltaNumLines
. The algorithm computes the minimal distinguishing prefix (MDP) between each term
and its previous term (alphabetically ordered). Then it selects in the neighborhood of the
targetNumBlockLines
, and within the deltaNumLines
, the term with the minimal MDP. This
term becomes the first term of the next block and its MDP is the block key. This block key is
added to the terms dictionary trie.
We call dictionary the trie structure in memory, and block file the disk file containing the block lines, with one term and its corresponding term state details per line.
When seeking a term, the dictionary seeks the floor leaf of the trie for the searched term and jumps to the corresponding file pointer in the block file. There, the block terms are scanned for the exact searched term.
The terms inside a block do not need to share a prefix. Only the block key is used to find the block from the dictionary trie. And the block key is selected because it is the locally smallest MDP. This makes the dictionary trie very compact.
An interesting property of the Uniform Split technique is the very linear balance between memory usage and lookup performance. By decreasing the target block size, the block scan becomes faster, and since there are more blocks, the dictionary trie memory usage increases. Additionally, small blocks are faster to read from disk. A good sweet spot for the target block size is 32 with delta of 3 (10%) (default values). This can be tuned in the constructor.
There are additional optimizations:
- Each block has a header that allows the lookup to jump directly to the middle term with a fast comparison. This reduces the linear scan by 2 for a small disk size increase.
- Each block term is incrementally encoded according to its previous term. This both reduces the disk size and speeds up the block scan.
- All term line details (the terms states) are written after all terms. This allows faster term scan without needing to decode the term states.
- All file pointers are base-encoded. Their value is relative to the block base file pointer (not to the previous file pointer), this allows to read the term state of any term independently.
Blocks can be compressed or encrypted with an optional BlockEncoder
provided in the
constructor
.
The block file
contains all the term
blocks for each field sequentially. It also contains the fields metadata at the end of the file.
The dictionary file
contains the
trie (FST
bytes) for each field sequentially.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
Modifier and TypeFieldDescriptionprotected final BlockEncoder
protected final IndexOutput
static final int
Default value for the maximum allowed delta variation of the block size (delta of the number of terms per block).static final int
Default value for the target block size (number of terms per block).protected final int
protected final IndexOutput
protected final FieldInfos
protected final FieldMetadata.Serializer
protected static final int
Upper limit of the block size (maximum number of terms per block).protected final int
protected final PostingsWriterBase
protected final int
-
Constructor Summary
ModifierConstructorDescriptionUniformSplitTermsWriter
(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder) protected
UniformSplitTermsWriter
(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder, FieldMetadata.Serializer fieldMetadataWriter, String codecName, int versionCurrent, String termsBlocksExtension, String dictionaryExtension) UniformSplitTermsWriter
(PostingsWriterBase postingsWriter, SegmentWriteState state, BlockEncoder blockEncoder) -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
protected static void
validateSettings
(int targetNumBlockLines, int deltaNumLines) Validates theconstructor
settings.void
write
(Fields fields, NormsProducer normsProducer) protected void
writeDictionary
(IndexDictionary.Builder dictionaryBuilder) Writes the dictionary index (FST) to disk.protected void
writeEncodedFieldsMetadata
(ByteBuffersDataOutput fieldsOutput) protected void
writeFieldsMetadata
(int fieldsNumber, ByteBuffersDataOutput fieldsOutput) protected int
writeFieldTerms
(BlockWriter blockWriter, DataOutput fieldsOutput, TermsEnum termsEnum, FieldInfo fieldInfo, NormsProducer normsProducer) protected BlockTermState
writePostingLine
(TermsEnum termsEnum, FieldMetadata fieldMetadata, NormsProducer normsProducer) Writes the posting values for the current term in the givenTermsEnum
and updates theFieldMetadata
stats.protected void
writeUnencodedFieldsMetadata
(ByteBuffersDataOutput fieldsOutput) Methods inherited from class org.apache.lucene.codecs.FieldsConsumer
merge
-
Field Details
-
DEFAULT_TARGET_NUM_BLOCK_LINES
public static final int DEFAULT_TARGET_NUM_BLOCK_LINESDefault value for the target block size (number of terms per block).- See Also:
-
DEFAULT_DELTA_NUM_LINES
public static final int DEFAULT_DELTA_NUM_LINESDefault value for the maximum allowed delta variation of the block size (delta of the number of terms per block). The block size will be [target block size]+-[allowed delta].- See Also:
-
MAX_NUM_BLOCK_LINES
protected static final int MAX_NUM_BLOCK_LINESUpper limit of the block size (maximum number of terms per block).- See Also:
-
fieldInfos
-
postingsWriter
-
maxDoc
protected final int maxDoc -
targetNumBlockLines
protected final int targetNumBlockLines -
deltaNumLines
protected final int deltaNumLines -
blockEncoder
-
fieldMetadataWriter
-
blockOutput
-
dictionaryOutput
-
-
Constructor Details
-
UniformSplitTermsWriter
public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, BlockEncoder blockEncoder) throws IOException - Parameters:
blockEncoder
- Optional block encoder, may be null if none. It can be used for compression or encryption.- Throws:
IOException
-
UniformSplitTermsWriter
public UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder) throws IOException - Parameters:
blockEncoder
- Optional block encoder, may be null if none. It can be used for compression or encryption.- Throws:
IOException
-
UniformSplitTermsWriter
protected UniformSplitTermsWriter(PostingsWriterBase postingsWriter, SegmentWriteState state, int targetNumBlockLines, int deltaNumLines, BlockEncoder blockEncoder, FieldMetadata.Serializer fieldMetadataWriter, String codecName, int versionCurrent, String termsBlocksExtension, String dictionaryExtension) throws IOException - Parameters:
targetNumBlockLines
- Target number of lines per block. Must be strictly greater than 0. The parameters can be pre-validated withvalidateSettings(int, int)
. There is one term per block line, with its corresponding details (TermState
).deltaNumLines
- Maximum allowed delta variation of the number of lines per block. Must be greater than or equal to 0 and strictly less thantargetNumBlockLines
. The block size will betargetNumBlockLines
+-deltaNumLines
. The block size must always be less than or equal toMAX_NUM_BLOCK_LINES
.blockEncoder
- Optional block encoder, may be null if none. It can be used for compression or encryption.- Throws:
IOException
-
-
Method Details
-
validateSettings
protected static void validateSettings(int targetNumBlockLines, int deltaNumLines) Validates theconstructor
settings.- Parameters:
targetNumBlockLines
- Target number of lines per block. Must be strictly greater than 0.deltaNumLines
- Maximum allowed delta variation of the number of lines per block. Must be greater than or equal to 0 and strictly less thantargetNumBlockLines
. Additionally,targetNumBlockLines
+deltaNumLines
must be less than or equal toMAX_NUM_BLOCK_LINES
.
-
write
- Specified by:
write
in classFieldsConsumer
- Throws:
IOException
-
writeFieldsMetadata
protected void writeFieldsMetadata(int fieldsNumber, ByteBuffersDataOutput fieldsOutput) throws IOException - Throws:
IOException
-
writeUnencodedFieldsMetadata
- Throws:
IOException
-
writeEncodedFieldsMetadata
- Throws:
IOException
-
writeFieldTerms
protected int writeFieldTerms(BlockWriter blockWriter, DataOutput fieldsOutput, TermsEnum termsEnum, FieldInfo fieldInfo, NormsProducer normsProducer) throws IOException - Returns:
- 1 if the field was written; 0 otherwise.
- Throws:
IOException
-
writePostingLine
protected BlockTermState writePostingLine(TermsEnum termsEnum, FieldMetadata fieldMetadata, NormsProducer normsProducer) throws IOException Writes the posting values for the current term in the givenTermsEnum
and updates theFieldMetadata
stats.- Returns:
- the written
BlockTermState
; or null if none. - Throws:
IOException
-
writeDictionary
Writes the dictionary index (FST) to disk.- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Specified by:
close
in classFieldsConsumer
- Throws:
IOException
-