Class FSTCompiler<T>
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the subclasses of Outputs
.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
It now supports 3 different workflows:
- Build FST and use it immediately entirely in RAM and then discard it
- Build FST and use it immediately entirely in RAM and also save it to other DataOutput, and load it later and use it
- Build FST but stream it immediately to disk (except the FSTMetaData, to be saved at the end). In order to use it, you need to construct the corresponding DataInput and use the FST constructor to read it.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionvoid
Add the next input/output pair.compile()
Returns the metadata of the final FST.long
long
long
float
Get the respectiveFSTReader
of theDataOutput
.long
static DataOutput
getOnHeapReaderWriter
(int blockBits) Get an on-heap DataOutput that allows the FST to be read immediately after writing, and also optionally saved to an external DataOutput.
-
Method Details
-
getOnHeapReaderWriter
Get an on-heap DataOutput that allows the FST to be read immediately after writing, and also optionally saved to an external DataOutput.- Parameters:
blockBits
- how many bits wide to make each block of the DataOutput- Returns:
- the DataOutput
-
getFSTReader
Get the respectiveFSTReader
of theDataOutput
. To call this method, you need to use the default DataOutput orgetOnHeapReaderWriter(int)
, otherwise we will throw an exception.- Returns:
- the DataOutput as FSTReader
- Throws:
IllegalStateException
- if the DataOutput does not implement FSTReader
-
getDirectAddressingMaxOversizingFactor
public float getDirectAddressingMaxOversizingFactor() -
getNodeCount
public long getNodeCount() -
getArcCount
public long getArcCount() -
add
Add the next input/output pair. The provided input must be sorted after the previous one according toIntsRef.compareTo(org.apache.lucene.util.IntsRef)
. It's also OK to add the same input twice in a row with different outputs, as long asOutputs
implements theOutputs.merge(T, T)
method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (egByteSequenceOutputs
orIntSequenceOutputs
) then you cannot reuse across calls.- Throws:
IOException
-
compile
Returns the metadata of the final FST. NOTE: this will return null if nothing is accepted by the FST themselves.To create the FST, you need to:
- If a FSTReader DataOutput was used, such as the one returned by
getOnHeapReaderWriter(int)
fstMetadata = fstCompiler.compile(); fst = FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader());
- If a non-FSTReader DataOutput was used, such as
IndexOutput
, you need to first create the correspondingDataInput
, such asIndexInput
then pass it to the FST constructfstMetadata = fstCompiler.compile(); fst = new FST<>(fstMetadata, dataInput, new OffHeapFSTStore());
- Throws:
IOException
-
fstRamBytesUsed
public long fstRamBytesUsed() -
fstSizeInBytes
public long fstSizeInBytes()
-