Class Viterbi<T extends Token,U extends Viterbi.Position>

java.lang.Object
org.apache.lucene.analysis.morph.Viterbi<T,U>
Type Parameters:
T - output token class
U - position class
Direct Known Subclasses:
ViterbiNBest

public abstract class Viterbi<T extends Token,U extends Viterbi.Position> extends Object
Performs Viterbi algorithm for morphological Tokenizers, which split texts by Hidden Markov Model or Conditional Random Fields.
  • Field Details

    • VERBOSE

      protected static final boolean VERBOSE
      See Also:
    • MAX_UNKNOWN_WORD_LENGTH

      protected static final int MAX_UNKNOWN_WORD_LENGTH
      See Also:
    • costs

      protected final ConnectionCosts costs
    • wordIdRef

      protected final IntsRef wordIdRef
    • buffer

      protected final RollingCharBuffer buffer
    • positions

      protected final Viterbi.WrappedPositionArray<U extends Viterbi.Position> positions
    • end

      protected boolean end
    • lastBackTracePos

      protected int lastBackTracePos
    • pos

      protected int pos
    • pending

      protected final List<T extends Token> pending
    • outputNBest

      protected boolean outputNBest
    • enableSpacePenaltyFactor

      protected boolean enableSpacePenaltyFactor
    • outputLongestUserEntryOnly

      protected boolean outputLongestUserEntryOnly
  • Constructor Details

  • Method Details

    • forward

      public final void forward() throws IOException
      Incrementally parse some more characters. This runs the viterbi search forwards "enough" so that we generate some more tokens. How much forward depends on the chars coming in, since some chars could cause longer-lasting ambiguity in the parsing. Once the ambiguity is resolved, then we back trace, produce the pending tokens, and return.
      Throws:
      IOException
    • shouldSkipProcessUnknownWord

      protected boolean shouldSkipProcessUnknownWord(int unknownWordEndIndex, Viterbi.Position posData)
    • processUnknownWord

      protected abstract int processUnknownWord(boolean anyMatches, Viterbi.Position posData) throws IOException
      Add unknown words to the position graph.
      Returns:
      word length
      Throws:
      IOException
    • backtrace

      protected abstract void backtrace(Viterbi.Position endPosData, int fromIDX) throws IOException
      Backtrace from the provided position, back to the last time we back-traced, accumulating the resulting tokens to the pending list. The pending list is then in-reverse (last token should be returned first).
      Throws:
      IOException
    • backtraceNBest

      protected void backtraceNBest(Viterbi.Position endPosData, boolean useEOS) throws IOException
      Backtrace the n-best path. Subclasses that support n-best paths should implement this method.
      Throws:
      IOException
    • fixupPendingList

      protected void fixupPendingList()
      Remove duplicated tokens from the pending list; this is needed because backtrace(Position, int) and backtraceNBest(Position, boolean) can add same tokens to the list. Subclasses that support n-best paths should implement this method.
    • add

      protected final void add(MorphData morphData, Viterbi.Position fromPosData, int wordPos, int endPos, int wordID, TokenType type, boolean addPenalty) throws IOException
      Add a token on the minimum cost path to the pending token list.
      Throws:
      IOException
    • computeSpacePenalty

      protected int computeSpacePenalty(MorphData morphData, int wordID, int numSpaces)
      Returns the space penalty.
    • computePenalty

      protected int computePenalty(int pos, int length) throws IOException
      Returns the penalty for a specific input region
      Throws:
      IOException
    • getPos

      public int getPos()
    • isEnd

      public boolean isEnd()
    • getPending

      public List<T> getPending()
    • isOutputNBest

      public boolean isOutputNBest()
    • resetBuffer

      public void resetBuffer(Reader reader)
    • resetState

      public void resetState()