public class Builder<T> extends Object
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the
subclasses of Outputs.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
| Modifier and Type | Class and Description |
|---|---|
static class |
Builder.Arc<T>
Expert: holds a pending (seen but not yet serialized) arc.
|
static class |
Builder.FreezeTail<T>
Expert: this is invoked by Builder whenever a suffix
is serialized.
|
static class |
Builder.UnCompiledNode<T>
Expert: holds a pending (seen but not yet serialized) Node.
|
| Constructor and Description |
|---|
Builder(FST.INPUT_TYPE inputType,
int minSuffixCount1,
int minSuffixCount2,
boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
Outputs<T> outputs,
Builder.FreezeTail<T> freezeTail,
boolean doPackFST,
float acceptableOverheadRatio,
boolean allowArrayArcs,
int bytesPageBits)
Instantiates an FST/FSA builder with all the possible tuning and construction
tweaks.
|
Builder(FST.INPUT_TYPE inputType,
Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning.
|
| Modifier and Type | Method and Description |
|---|---|
void |
add(IntsRef input,
T output)
It's OK to add the same input twice in a row with
different outputs, as long as outputs impls the merge
method.
|
FST<T> |
finish()
Returns final FST.
|
long |
fstSizeInBytes() |
long |
getMappedStateCount() |
long |
getTermCount() |
long |
getTotStateCount() |
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Builder(FST.INPUT_TYPE, int, int, boolean,
boolean, int, Outputs, FreezeTail, boolean, float,
boolean, int) with pruning options turned off.public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits)
inputType - The input type (transition labels). Can be anything from FST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are
represented as FST.INPUT_TYPE.BYTE4 (full unicode codepoints).minSuffixCount1 - If pruning the input graph during construction, this threshold is used for telling
if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node
is kept.minSuffixCount2 - (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix - If true, the shared suffixes will be compacted into unique paths.
This requires an additional hash map for lookups in memory. Setting this parameter to
false creates a single path for all input sequences. This will result in a larger
graph, but may require less memory and will speed up construction.doShareNonSingletonNodes - Only used if doShareSuffix is true. Set this to
true to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.shareMaxTailLength - Only used if doShareSuffix is true. Set this to
Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.outputs - The output type for each input sequence. Applies only if building an FST. For
FSA, use NoOutputs.getSingleton() and NoOutputs.getNoOutput() as the
singleton output object.doPackFST - Pass true to create a packed FST.acceptableOverheadRatio - How to trade speed for space when building the FST. This option
is only relevant when doPackFST is true. @see PackedInts#getMutable(int, int, float)allowArrayArcs - Pass false to disable the array arc optimization
while building the FST; this will make the resulting
FST smaller but slower to traverse.bytesPageBits - How many bits wide to make each
byte[] block in the BytesStore; if you know the FST
will be large then make this larger. For example 15
bits = 32768 byte pages.public long getTotStateCount()
public long getTermCount()
public long getMappedStateCount()
public void add(IntsRef input, T output) throws IOException
ByteSequenceOutputs or IntSequenceOutputs) then you cannot reuse across
calls.IOExceptionpublic FST<T> finish() throws IOException
IOExceptionpublic long fstSizeInBytes()
Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.