|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.util.fst.Builder<T>
public class Builder<T>
Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the
subclasses of Outputs
.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
Nested Class Summary | |
---|---|
static class |
Builder.Arc<T>
Expert: holds a pending (seen but not yet serialized) arc. |
static class |
Builder.FreezeTail<T>
Expert: this is invoked by Builder whenever a suffix is serialized. |
static class |
Builder.UnCompiledNode<T>
Expert: holds a pending (seen but not yet serialized) Node. |
Constructor Summary | |
---|---|
Builder(FST.INPUT_TYPE inputType,
int minSuffixCount1,
int minSuffixCount2,
boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
Outputs<T> outputs,
Builder.FreezeTail<T> freezeTail,
boolean doPackFST,
float acceptableOverheadRatio,
boolean allowArrayArcs,
int bytesPageBits)
Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. |
|
Builder(FST.INPUT_TYPE inputType,
Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning. |
Method Summary | |
---|---|
void |
add(IntsRef input,
T output)
It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method. |
FST<T> |
finish()
Returns final FST. |
long |
fstSizeInBytes()
|
long |
getMappedStateCount()
|
long |
getTermCount()
|
long |
getTotStateCount()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Builder(FST.INPUT_TYPE, int, int, boolean,
boolean, int, Outputs, FreezeTail, boolean, float,
boolean, int)
with pruning options turned off.
public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, boolean doPackFST, float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits)
inputType
- The input type (transition labels). Can be anything from FST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are
represented as FST.INPUT_TYPE.BYTE4
(full unicode codepoints).minSuffixCount1
- If pruning the input graph during construction, this threshold is used for telling
if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node
is kept.minSuffixCount2
- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix
- If true
, the shared suffixes will be compacted into unique paths.
This requires an additional hash map for lookups in memory. Setting this parameter to
false
creates a single path for all input sequences. This will result in a larger
graph, but may require less memory and will speed up construction.doShareNonSingletonNodes
- Only used if doShareSuffix is true. Set this to
true to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.shareMaxTailLength
- Only used if doShareSuffix is true. Set this to
Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.outputs
- The output type for each input sequence. Applies only if building an FST. For
FSA, use NoOutputs.getSingleton()
and NoOutputs.getNoOutput()
as the
singleton output object.doPackFST
- Pass true to create a packed FST.acceptableOverheadRatio
- How to trade speed for space when building the FST. This option
is only relevant when doPackFST is true. @see PackedInts#getMutable(int, int, float)allowArrayArcs
- Pass false to disable the array arc optimization
while building the FST; this will make the resulting
FST smaller but slower to traverse.bytesPageBits
- How many bits wide to make each
byte[] block in the BytesStore; if you know the FST
will be large then make this larger. For example 15
bits = 32768 byte pages.Method Detail |
---|
public long getTotStateCount()
public long getTermCount()
public long getMappedStateCount()
public void add(IntsRef input, T output) throws IOException
ByteSequenceOutputs
or IntSequenceOutputs
) then you cannot reuse across
calls.
IOException
public FST<T> finish() throws IOException
IOException
public long fstSizeInBytes()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |