Package org.apache.lucene.analysis.morph
Class Viterbi<T extends Token,U extends Viterbi.Position>
java.lang.Object
org.apache.lucene.analysis.morph.Viterbi<T,U>
- Type Parameters:
T
- output token classU
- position class
- Direct Known Subclasses:
ViterbiNBest
Performs Viterbi algorithm for
morphological Tokenizers, which split texts by Hidden Markov Model or Conditional Random Fields.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Holds all back pointers arriving to this position.static final class
Holds partial graph (array of positions) for calculating the minimum cost path -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final RollingCharBuffer
protected final ConnectionCosts
protected boolean
protected boolean
protected int
protected static final int
protected boolean
protected boolean
protected int
protected final Viterbi.WrappedPositionArray
<U> protected static final boolean
protected final IntsRef
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprotected
Viterbi
(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class<U> positionImpl) -
Method Summary
Modifier and TypeMethodDescriptionprotected final void
add
(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) Add a token on the minimum cost path to the pending token list.protected abstract void
backtrace
(Viterbi.Position endPosData, int fromIDX) Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list.protected void
backtraceNBest
(Viterbi.Position endPosData, boolean useEOS) Backtrace the n-best path.protected int
computePenalty
(int pos, int length) Returns the penalty for a specific input regionprotected int
computeSpacePenalty
(MorphData morphData, int wordID, int numSpaces) Returns the space penalty.protected void
Remove duplicated tokens from the pending list; this is needed becausebacktrace(Position, int)
andbacktraceNBest(Position, boolean)
can add same tokens to the list.final void
forward()
Incrementally parse some more characters.int
getPos()
boolean
isEnd()
boolean
protected abstract int
processUnknownWord
(boolean anyMatches, Viterbi.Position posData) Add unknown words to the position graph.void
resetBuffer
(Reader reader) void
protected boolean
shouldSkipProcessUnknownWord
(int unknownWordEndIndex, Viterbi.Position posData)
-
Field Details
-
VERBOSE
protected static final boolean VERBOSE- See Also:
-
MAX_UNKNOWN_WORD_LENGTH
protected static final int MAX_UNKNOWN_WORD_LENGTH- See Also:
-
costs
-
wordIdRef
-
buffer
-
positions
-
end
protected boolean end -
lastBackTracePos
protected int lastBackTracePos -
pos
protected int pos -
pending
-
outputNBest
protected boolean outputNBest -
enableSpacePenaltyFactor
protected boolean enableSpacePenaltyFactor -
outputLongestUserEntryOnly
protected boolean outputLongestUserEntryOnly
-
-
Constructor Details
-
Viterbi
protected Viterbi(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class<U> positionImpl)
-
-
Method Details
-
forward
Incrementally parse some more characters. This runs the viterbi search forwards "enough" so that we generate some more tokens. How much forward depends on the chars coming in, since some chars could cause longer-lasting ambiguity in the parsing. Once the ambiguity is resolved, then we back trace, produce the pending tokens, and return.- Throws:
IOException
-
shouldSkipProcessUnknownWord
-
processUnknownWord
protected abstract int processUnknownWord(boolean anyMatches, Viterbi.Position posData) throws IOException Add unknown words to the position graph.- Returns:
- word length
- Throws:
IOException
-
backtrace
Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list. The pending list is then in-reverse (last token should be returned first).- Throws:
IOException
-
backtraceNBest
Backtrace the n-best path. Subclasses that support n-best paths should implement this method.- Throws:
IOException
-
fixupPendingList
protected void fixupPendingList()Remove duplicated tokens from the pending list; this is needed becausebacktrace(Position, int)
andbacktraceNBest(Position, boolean)
can add same tokens to the list. Subclasses that support n-best paths should implement this method. -
add
protected final void add(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) throws IOException Add a token on the minimum cost path to the pending token list.- Throws:
IOException
-
computeSpacePenalty
Returns the space penalty. -
computePenalty
Returns the penalty for a specific input region- Throws:
IOException
-
getPos
public int getPos() -
isEnd
public boolean isEnd() -
getPending
-
isOutputNBest
public boolean isOutputNBest() -
resetBuffer
-
resetState
public void resetState()
-