Builder (Lucene 3.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.util.fst
Class Builder<T>

java.lang.Object
  org.apache.lucene.util.fst.Builder<T>

public class Builder<T>
extends Object
extends Object

Builds a compact FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs (the FST becomes an FSA if you use NoOutputs). The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

If your outputs are ByteSequenceOutput then the final FST will be minimal, but if you use PositiveIntOutput then it's only "near minimal". For example, aa/0, aab/1, bbb/2 will produce 6 states when a 5 state fst is also possible. The parameterized type T is the output type. See the subclasses of Outputs.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Constructor Summary
`Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs)` Instantiates an FST/FSA builder with all the possible tuning and construction tweaks.
`Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)` Instantiates an FST/FSA builder without any pruning.

Method Summary
`void`	`add(BytesRef input, T output)`
`void`	`add(char[] s, int offset, int length, T output)` Sugar: adds the UTF32 codepoints from char[] slice.
`void`	`add(CharSequence s, T output)` Sugar: adds the UTF32 codepoints from CharSequence.
`void`	`add(IntsRef input, T output)` It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method.
`FST<T>`	`finish()` Returns final FST.
`int`	`getMappedStateCount()`
`long`	`getTermCount()`
`int`	`getTotStateCount()`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Builder

public Builder(FST.INPUT_TYPE inputType,
               Outputs<T> outputs)

Instantiates an FST/FSA builder without any pruning. A shortcut to Builder(FST.INPUT_TYPE, int, int, boolean, boolean, int, Outputs) with pruning options turned off.

Builder

public Builder(FST.INPUT_TYPE inputType,
               int minSuffixCount1,
               int minSuffixCount2,
               boolean doShareSuffix,
               boolean doShareNonSingletonNodes,
               int shareMaxTailLength,
               Outputs<T> outputs)

Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. Read parameter documentation carefully.

Parameters:: inputType - The input type (transition labels). Can be anything from FST.INPUT_TYPE enumeration. Shorter types will consume less memory. Strings (character sequences) are represented as FST.INPUT_TYPE.BYTE4 (full unicode codepoints).; minSuffixCount1 - If pruning the input graph during construction, this threshold is used for telling if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept.; minSuffixCount2 - (Note: only Mike McCandless knows what this one is really doing...); doShareSuffix - If true, the shared suffixes will be compacted into unique paths. This requires an additional hash map for lookups in memory. Setting this parameter to false creates a single path for all input sequences. This will result in a larger graph, but may require less memory and will speed up construction.; doShareNonSingletonNodes - Only used if doShareSuffix is true. Set this to true to ensure FST is fully minimal, at cost of more CPU and more RAM during building.; shareMaxTailLength - Only used if doShareSuffix is true. Set this to Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more CPU and more RAM during building.; outputs - The output type for each input sequence. Applies only if building an FST. For FSA, use NoOutputs.getSingleton() and NoOutputs.getNoOutput() as the singleton output object.

Method Detail

getTotStateCount

public int getTotStateCount()

getTermCount

public long getTermCount()

getMappedStateCount

public int getMappedStateCount()

add

public void add(BytesRef input,
                T output)
         throws IOException

Throws:: IOException

add

public void add(char[] s,
                int offset,
                int length,
                T output)
         throws IOException

Sugar: adds the UTF32 codepoints from char[] slice. FST must be FST.INPUT_TYPE.BYTE4!

Throws:: IOException

add

public void add(CharSequence s,
                T output)
         throws IOException

Sugar: adds the UTF32 codepoints from CharSequence. FST must be FST.INPUT_TYPE.BYTE4!

Throws:: IOException

add

public void add(IntsRef input,
                T output)
         throws IOException

It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method.

Throws:: IOException

finish

public FST<T> finish()
              throws IOException

Returns final FST. NOTE: this will return null if nothing is accepted by the FST.

Throws:: IOException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.util.fst Class Builder<T>

Builder

Builder

getTotStateCount

getTermCount

getMappedStateCount

add

add

add

add

finish

org.apache.lucene.util.fst
Class Builder<T>