org.apache.lucene.codecs.memory
Class FSTTermsWriter
java.lang.Object
org.apache.lucene.codecs.FieldsConsumer
org.apache.lucene.codecs.memory.FSTTermsWriter
- All Implemented Interfaces:
- Closeable
public class FSTTermsWriter
- extends FieldsConsumer
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>.
Term metadata consists of three parts:
1. term statistics: docFreq, totalTermFreq;
2. monotonic long[], e.g. the pointer to the postings list for that term;
3. generic byte[], e.g. other information need by postings reader.
File:
Term Dictionary
The .tst contains a list of FSTs, one for each field.
The FST maps a term to its corresponding statistics (e.g. docfreq)
and metadata (e.g. information for postings list reader like file pointer
to postings list).
Typically the metadata is separated into two parts:
-
Monotonical long array: Some metadata will always be ascending in order
with the corresponding term. This part is used by FST to share outputs between arcs.
-
Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?,
SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST -->
FST<TermData>
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?,
< DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header -->
CodecHeader
- DirOffset -->
Uint64
- DocFreq, LongsSize, BytesSize, NumFields,
FieldNumber, DocCount -->
VInt
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta -->
VLong
Notes:
-
The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation:
they contain arbitrary per-file data (such as parameters or versioning information), and per-term data
(non-monotonic ones like pulsed postings data).
-
The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs,
while in deeper arcs only generic bytes and term statistics exist.
-
The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part
is omitted when it is an array of 0s.
-
Since LongsSize is per-field fixed, it is only written once in field summary.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TERMS_VERSION_START
public static final int TERMS_VERSION_START
- See Also:
- Constant Field Values
TERMS_VERSION_CURRENT
public static final int TERMS_VERSION_CURRENT
- See Also:
- Constant Field Values
FSTTermsWriter
public FSTTermsWriter(SegmentWriteState state,
PostingsWriterBase postingsWriter)
throws IOException
- Throws:
IOException
addField
public TermsConsumer addField(FieldInfo field)
throws IOException
- Specified by:
addField
in class FieldsConsumer
- Throws:
IOException
close
public void close()
throws IOException
- Specified by:
close
in interface Closeable
- Specified by:
close
in class FieldsConsumer
- Throws:
IOException
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.