org.apache.lucene.util.fst
public class Builder<T> extends Object
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the
subclasses of Outputs
.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
Modifier and Type | Class and Description |
---|---|
static class |
Builder.Arc<T>
Expert: holds a pending (seen but not yet serialized) arc.
|
static class |
Builder.FreezeTail<T>
Expert: this is invoked by Builder whenever a suffix
is serialized.
|
static class |
Builder.UnCompiledNode<T>
Expert: holds a pending (seen but not yet serialized) Node.
|
Constructor and Description |
---|
Builder(FST.INPUT_TYPE inputType,
int minSuffixCount1,
int minSuffixCount2,
boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
Outputs<T> outputs,
Builder.FreezeTail<T> freezeTail,
boolean doPackFST,
float acceptableOverheadRatio,
boolean allowArrayArcs,
int bytesPageBits)
Instantiates an FST/FSA builder with all the possible tuning and construction
tweaks.
|
Builder(FST.INPUT_TYPE inputType,
Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning.
|
Modifier and Type | Method and Description |
---|---|
void |
add(IntsRef input,
T output)
It's OK to add the same input twice in a row with
different outputs, as long as outputs impls the merge
method.
|
FST<T> |
finish()
Returns final FST.
|
long |
fstSizeInBytes() |
long |
getMappedStateCount() |
long |
getTermCount() |
long |
getTotStateCount() |
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Builder(FST.INPUT_TYPE, int, int, boolean,
boolean, int, Outputs, FreezeTail, boolean, float,
boolean, int)
with pruning options turned off.public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits)
inputType
- The input type (transition labels). Can be anything from FST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are
represented as FST.INPUT_TYPE.BYTE4
(full unicode codepoints).minSuffixCount1
- If pruning the input graph during construction, this threshold is used for telling
if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node
is kept.minSuffixCount2
- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix
- If true
, the shared suffixes will be compacted into unique paths.
This requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter to
false
creates a single suffix path for all input sequences. This will result in a larger
FST, but requires substantially less memory and CPU during building.doShareNonSingletonNodes
- Only used if doShareSuffix is true. Set this to
true to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.shareMaxTailLength
- Only used if doShareSuffix is true. Set this to
Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.outputs
- The output type for each input sequence. Applies only if building an FST. For
FSA, use NoOutputs.getSingleton()
and NoOutputs.getNoOutput()
as the
singleton output object.doPackFST
- Pass true to create a packed FST.acceptableOverheadRatio
- How to trade speed for space when building the FST. This option
is only relevant when doPackFST is true. @see PackedInts#getMutable(int, int, float)allowArrayArcs
- Pass false to disable the array arc optimization
while building the FST; this will make the resulting
FST smaller but slower to traverse.bytesPageBits
- How many bits wide to make each
byte[] block in the BytesStore; if you know the FST
will be large then make this larger. For example 15
bits = 32768 byte pages.public long getTotStateCount()
public long getTermCount()
public long getMappedStateCount()
public void add(IntsRef input, T output) throws IOException
ByteSequenceOutputs
or IntSequenceOutputs
) then you cannot reuse across
calls.IOException
public FST<T> finish() throws IOException
IOException
public long fstSizeInBytes()
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.