java.lang.Object

org.apache.lucene.analysis.morph.Viterbi<T,U>

Type Parameters:: T - output token class; U - position class

Direct Known Subclasses:: ViterbiNBest

public abstract class Viterbi<T extends Token,U extends Viterbi.Position> extends Object

Performs Viterbi algorithm for morphological Tokenizers, which split texts by Hidden Markov Model or Conditional Random Fields.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

Viterbi.Position

Holds all back pointers arriving to this position.

static final class

Viterbi.WrappedPositionArray

Holds partial graph (array of positions) for calculating the minimum cost path
Field Summary

Fields

Modifier and Type

Field

Description

protected final RollingCharBuffer

buffer

protected final ConnectionCosts

costs

protected boolean

enableSpacePenaltyFactor

protected boolean

end

protected int

lastBackTracePos

protected static final int

MAX_UNKNOWN_WORD_LENGTH

protected boolean

outputLongestUserEntryOnly

protected boolean

outputNBest

protected final List<T>

pending

protected int

pos

protected final Viterbi.WrappedPositionArray

positions

protected static final boolean

VERBOSE

protected final IntsRef

wordIdRef
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

Viterbi(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class positionImpl)
Method Summary

Modifier and Type

Method

Description

protected final void

add(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty)

Add a token on the minimum cost path to the pending token list.

protected abstract void

backtrace(Viterbi.Position endPosData, int fromIDX)

Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list.

protected void

backtraceNBest(Viterbi.Position endPosData, boolean useEOS)

Backtrace the n-best path.

protected int

computePenalty(int pos, int length)

Returns the penalty for a specific input region

protected int

computeSpacePenalty(MorphData morphData, int wordID, int numSpaces)

Returns the space penalty.

protected void

fixupPendingList()

Remove duplicated tokens from the pending list; this is needed because backtrace(Position, int) and backtraceNBest(Position, boolean) can add same tokens to the list.

final void

forward()

Incrementally parse some more characters.

List<T>

getPending()

int

getPos()

boolean

isEnd()

boolean

isOutputNBest()

protected abstract int

processUnknownWord(boolean anyMatches, Viterbi.Position posData)

Add unknown words to the position graph.

void

resetBuffer(Reader reader)

void

resetState()

protected boolean

shouldSkipProcessUnknownWord(int unknownWordEndIndex, Viterbi.Position posData)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- VERBOSE
 
 protected static final boolean VERBOSE
 See Also:
 
 Constant Field Values
- MAX_UNKNOWN_WORD_LENGTH
 
 protected static final int MAX_UNKNOWN_WORD_LENGTH
 See Also:
 
 Constant Field Values
- costs
 
 protected final ConnectionCosts costs
- wordIdRef
 
 protected final IntsRef wordIdRef
- buffer
 
 protected final RollingCharBuffer buffer
- positions
 
 protected final Viterbi.WrappedPositionArray positions
- end
 
 protected boolean end
- lastBackTracePos
 
 protected int lastBackTracePos
- pos
 
 protected int pos
- pending
 
 protected final List<T extends Token> pending
- outputNBest
 
 protected boolean outputNBest
- enableSpacePenaltyFactor
 
 protected boolean enableSpacePenaltyFactor
- outputLongestUserEntryOnly
 
 protected boolean outputLongestUserEntryOnly
Constructor Details
- Viterbi
 
 protected Viterbi(TokenInfoFST fst, FST.BytesReader fstReader, BinaryDictionary<? extends MorphData> dictionary, TokenInfoFST userFST, FST.BytesReader userFSTReader, Dictionary<? extends MorphData> userDictionary, ConnectionCosts costs, Class positionImpl)
Method Details
- forward
 
 public final void forward() throws IOException
 
 Incrementally parse some more characters. This runs the viterbi search forwards "enough" so that we generate some more tokens. How much forward depends on the chars coming in, since some chars could cause longer-lasting ambiguity in the parsing. Once the ambiguity is resolved, then we back trace, produce the pending tokens, and return.
 
 Throws:
 
 IOException
- shouldSkipProcessUnknownWord
 
 protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, Viterbi.Position posData)
- processUnknownWord
 
 protected abstract int processUnknownWord(boolean anyMatches, Viterbi.Position posData) throws IOException
 
 Add unknown words to the position graph.
 
 Returns:
 
 word length
 
 Throws:
 
 IOException
- backtrace
 
 protected abstract void backtrace(Viterbi.Position endPosData, int fromIDX) throws IOException
 
 Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list. The pending list is then in-reverse (last token should be returned first).
 
 Throws:
 
 IOException
- backtraceNBest
 
 protected void backtraceNBest(Viterbi.Position endPosData, boolean useEOS) throws IOException
 
 Backtrace the n-best path. Subclasses that support n-best paths should implement this method.
 
 Throws:
 
 IOException
- fixupPendingList
 
 protected void fixupPendingList()
 
 Remove duplicated tokens from the pending list; this is needed because backtrace(Position, int) and backtraceNBest(Position, boolean) can add same tokens to the list. Subclasses that support n-best paths should implement this method.
- add
 
 protected final void add(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) throws IOException
 
 Add a token on the minimum cost path to the pending token list.
 
 Throws:
 
 IOException
- computeSpacePenalty
 
 protected int computeSpacePenalty(MorphData morphData, int wordID, int numSpaces)
 
 Returns the space penalty.
- computePenalty
 
 protected int computePenalty(int pos, int length) throws IOException
 
 Returns the penalty for a specific input region
 
 Throws:
 
 IOException
- getPos
 
 public int getPos()
- isEnd
 
 public boolean isEnd()
- getPending
 
 public List<T> getPending()
- isOutputNBest
 
 public boolean isOutputNBest()
- resetBuffer
 
 public void resetBuffer(Reader reader)
- resetState
 
 public void resetState()

Class Viterbi<T extends Token,U extends Viterbi.Position>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

VERBOSE

MAX_UNKNOWN_WORD_LENGTH

costs

wordIdRef

buffer

positions

end

lastBackTracePos

pos

pending

outputNBest

enableSpacePenaltyFactor

outputLongestUserEntryOnly

Constructor Details

Viterbi

Method Details

forward

shouldSkipProcessUnknownWord

processUnknownWord

backtrace

backtraceNBest

fixupPendingList

add

computeSpacePenalty

computePenalty

getPos

isEnd

getPending

isOutputNBest

resetBuffer

resetState