|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.util.fst.Builder<T>
public class Builder<T>
Builds a compact FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs (the FST becomes an FSA if you use NoOutputs). The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
If your outputs are ByteSequenceOutput then the final FST will be minimal, but if you use PositiveIntOutput then it's only "near minimal". For example, aa/0, aab/1, bbb/2 will produce 6 states when a 5 state fst is also possible. The parameterized type T is the output type. See the subclasses ofOutputs
.
Constructor Summary | |
---|---|
Builder(FST.INPUT_TYPE inputType,
int minSuffixCount1,
int minSuffixCount2,
boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
Outputs<T> outputs)
Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. |
|
Builder(FST.INPUT_TYPE inputType,
Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning. |
Method Summary | |
---|---|
void |
add(BytesRef input,
T output)
|
void |
add(char[] s,
int offset,
int length,
T output)
Sugar: adds the UTF32 codepoints from char[] slice. |
void |
add(CharSequence s,
T output)
Sugar: adds the UTF32 codepoints from CharSequence. |
void |
add(IntsRef input,
T output)
It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method. |
FST<T> |
finish()
Returns final FST. |
int |
getMappedStateCount()
|
long |
getTermCount()
|
int |
getTotStateCount()
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Builder(FST.INPUT_TYPE, int, int, boolean, boolean, int, Outputs)
with
pruning options turned off.
public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs)
inputType
- The input type (transition labels). Can be anything from FST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are
represented as FST.INPUT_TYPE.BYTE4
(full unicode codepoints).minSuffixCount1
- If pruning the input graph during construction, this threshold is used for telling
if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node
is kept.minSuffixCount2
- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix
- If true
, the shared suffixes will be compacted into unique paths.
This requires an additional hash map for lookups in memory. Setting this parameter to
false
creates a single path for all input sequences. This will result in a larger
graph, but may require less memory and will speed up construction.doShareNonSingletonNodes
- Only used if doShareSuffix is true. Set this to
true to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.shareMaxTailLength
- Only used if doShareSuffix is true. Set this to
Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.outputs
- The output type for each input sequence. Applies only if building an FST. For
FSA, use NoOutputs.getSingleton()
and NoOutputs.getNoOutput()
as the
singleton output object.Method Detail |
---|
public int getTotStateCount()
public long getTermCount()
public int getMappedStateCount()
public void add(BytesRef input, T output) throws IOException
IOException
public void add(char[] s, int offset, int length, T output) throws IOException
IOException
public void add(CharSequence s, T output) throws IOException
IOException
public void add(IntsRef input, T output) throws IOException
IOException
public FST<T> finish() throws IOException
IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |