Package org.apache.lucene.util.fst
Finite state transducers
This package implements Finite State Transducers with the following characteristics:
- Fast and low memory overhead construction of the minimal FST (but inputs must be provided in sorted order)
- Low object overhead and quick deserialization (byte[] representation)
- Pluggable
Outputs
representation N-shortest-paths
search by weight- Enumerators (
IntsRef
andBytesRef
) that behave likeSortedMap
iterators
FST Construction example:
// Input values (keys). These must be provided to Builder in Unicode code point (UTF8 or UTF32) sorted order. // Note that sorting by Java's String.compareTo, which is UTF16 sorted order, is not correct and can lead to // exceptions while building the FST: String inputValues[] = {"cat", "dog", "dogs"}; long outputValues[] = {5, 7, 12}; PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton(); FSTCompiler<Long> fstCompiler = new FSTCompiler<Long>(INPUT_TYPE.BYTE1, outputs); BytesRefBuilder scratchBytes = new BytesRefBuilder(); IntsRefBuilder scratchInts = new IntsRefBuilder(); for (int i = 0; i < inputValues.length; i++) { scratchBytes.copyChars(inputValues[i]); fstCompiler.add(Util.toIntsRef(scratchBytes.toBytesRef(), scratchInts), outputValues[i]); } FST<Long> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());Retrieval by key:
Long value = Util.get(fst, new BytesRef("dog")); System.out.println(value); // 7Retrieval by value:
// Only works because outputs are also in sorted order IntsRef key = Util.getByOutput(fst, 12); System.out.println(Util.toBytesRef(key, scratchBytes).utf8ToString()); // dogsIterate over key-value pairs in sorted order:
// Like TermsEnum, this also supports seeking (advance) BytesRefFSTEnum<Long> iterator = new BytesRefFSTEnum<Long>(fst); while (iterator.next() != null) { InputOutput<Long> mapEntry = iterator.current(); System.out.println(mapEntry.input.utf8ToString()); System.out.println(mapEntry.output); }N-shortest paths by weight:
Comparator<Long> comparator = new Comparator<Long>() { public int compare(Long left, Long right) { return left.compareTo(right); } }; Arc<Long> firstArc = fst.getFirstArc(new Arc<Long>()); Util.TopResults<Long> paths = Util.shortestPaths(fst, firstArc, fst.outputs.getNoOutput(), comparator, 3, true); System.out.println(Util.toBytesRef(paths.topN.get(0).input, scratchBytes).utf8ToString()); // cat System.out.println(paths.topN.get(0).output); // 5 System.out.println(Util.toBytesRef(paths.topN.get(1).input, scratchBytes).utf8ToString()); // dog System.out.println(paths.topN.get(1).output); // 7
-
Interface Summary Interface Description FSTReader Abstraction for reading bytes necessary for FST.FSTStore A type ofFSTReader
which needs data to be initialized before use -
Class Summary Class Description ByteSequenceOutputs An FSTOutputs
implementation where each output is a sequence of bytes.BytesRefFSTEnum<T> Enumerates all input (BytesRef) + output pairs in an FST.BytesRefFSTEnum.InputOutput<T> Holds a single input (BytesRef) + output pair.CharSequenceOutputs An FSTOutputs
implementation where each output is a sequence of characters.FST<T> Represents an finite state machine (FST), using a compact byte[] format.FST.Arc<T> Represents a single arc.FST.BytesReader Reads bytes stored in an FST.FST.FSTMetadata<T> Represents the FST metadata.FSTCompiler<T> Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs.FSTCompiler.Builder<T> Fluent-style constructor for FSTFSTCompiler
.IntSequenceOutputs An FSTOutputs
implementation where each output is a sequence of ints.IntsRefFSTEnum<T> Enumerates all input (IntsRef) + output pairs in an FST.IntsRefFSTEnum.InputOutput<T> Holds a single input (IntsRef) + output pair.NoOutputs A null FSTOutputs
implementation; use this if you just want to build an FSA.OffHeapFSTStore Provides off heap storage of finite state machine (FST), using underlying index input instead of byte store on heapOnHeapFSTStore Provides storage of finite state machine (FST), using byte array or byte store allocated on heap.Outputs<T> Represents the outputs for an FST, providing the basic algebra required for building and traversing the FST.PairOutputs<A,B> An FSTOutputs
implementation, holding two other outputs.PairOutputs.Pair<A,B> Holds a single pair of two outputs.PositiveIntOutputs An FSTOutputs
implementation where each output is a non-negative long value.Util Static helper methods.Util.FSTPath<T> Represents a path in TopNSearcher.Util.Result<T> Holds a single input (IntsRef) + output, returned byshortestPaths()
.Util.TopNSearcher<T> Utility class to find top N shortest paths from start point(s).Util.TopResults<T> Holds the results for a top N search usingUtil.TopNSearcher
-
Enum Summary Enum Description FST.INPUT_TYPE Specifies allowed range of each int input label for this FST.