Class TokenStreamToAutomaton

java.lang.Object
org.apache.lucene.analysis.TokenStreamToAutomaton

public class TokenStreamToAutomaton extends Object
Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the TermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    We add this arc to represent a hole.
    static final int
    We create transition between two adjacent tokens.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Sole constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected BytesRef
    Subclass and implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.
    void
    setFinalOffsetGapAsHole(boolean finalOffsetGapAsHole)
    If true, any final offset gaps will result in adding a position hole.
    void
    setPreservePositionIncrements(boolean enablePositionIncrements)
    Whether to generate holes in the automaton for missing positions, true by default.
    void
    setUnicodeArcs(boolean unicodeArcs)
    Whether to make transition labels Unicode code points instead of UTF8 bytes, false by default
    Pulls the graph (including PositionLengthAttribute) from the provided TokenStream, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • POS_SEP

      public static final int POS_SEP
      We create transition between two adjacent tokens.
      See Also:
    • HOLE

      public static final int HOLE
      We add this arc to represent a hole.
      See Also:
  • Constructor Details

    • TokenStreamToAutomaton

      public TokenStreamToAutomaton()
      Sole constructor.
  • Method Details

    • setPreservePositionIncrements

      public void setPreservePositionIncrements(boolean enablePositionIncrements)
      Whether to generate holes in the automaton for missing positions, true by default.
    • setFinalOffsetGapAsHole

      public void setFinalOffsetGapAsHole(boolean finalOffsetGapAsHole)
      If true, any final offset gaps will result in adding a position hole.
    • setUnicodeArcs

      public void setUnicodeArcs(boolean unicodeArcs)
      Whether to make transition labels Unicode code points instead of UTF8 bytes, false by default
    • changeToken

      protected BytesRef changeToken(BytesRef in)
      Subclass and implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.
    • toAutomaton

      public Automaton toAutomaton(TokenStream in) throws IOException
      Pulls the graph (including PositionLengthAttribute) from the provided TokenStream, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.
      Throws:
      IOException