Class RegExp

java.lang.Object
org.apache.lucene.util.automaton.RegExp

public class RegExp extends Object
Regular Expression extension to Automaton.

Regular expressions are built from the following abstract syntax:

description of regular expression grammar
regexp ::= unionexp
|
unionexp ::= interexp | unionexp (union)
| interexp
interexp ::= concatexp & interexp (intersection) [OPTIONAL]
| concatexp
concatexp ::= repeatexp concatexp (concatenation)
| repeatexp
repeatexp ::= repeatexp ? (zero or one occurrence)
| repeatexp * (zero or more occurrences)
| repeatexp + (one or more occurrences)
| repeatexp {n} (n occurrences)
| repeatexp {n,} (n or more occurrences)
| repeatexp {n,m} (n to m occurrences, including both)
| complexp
charclassexp ::= [ charclasses ] (character class)
| [^ charclasses ] (negated character class)
| simpleexp
charclasses ::= charclass charclasses
| charclass
charclass ::= charexp - charexp (character range, including end-points)
| charexp
simpleexp ::= charexp
| . (any single character)
| # (the empty language) [OPTIONAL]
| @ (any string) [OPTIONAL]
| " <Unicode string without double-quotes>  " (a string)
| ( ) (the empty string)
| ( unionexp ) (precedence override)
| < <identifier> > (named automaton) [OPTIONAL]
| <n-m> (numerical interval) [OPTIONAL]
charexp ::= <Unicode character> (a single non-reserved character)
| \d (a digit [0-9])
| \D (a non-digit [^0-9])
| \s (whitespace [ \t\n\r])
| \S (non whitespace [^\s])
| \w (a word character [a-zA-Z_0-9])
| \W (a non word character [^\w])
| \ <Unicode character>  (a single character)

The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (\) or double-quotes ( "..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (> ) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).

WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static enum 
    The type of expression represented by a RegExp node.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Syntax flag, enables all optional regexp syntax.
    static final int
    Syntax flag, enables anystring (@).
    static final int
    Deprecated.
    static final int
    Syntax flag, enables named automata (<identifier>).
    final int
    Character expression
    static final int
    Allows case-insensitive matching of most Unicode characters.
    static final int
    Similar to CASE_INSENSITIVE but for character class ranges.
    static final int
    Deprecated.
    This method will be removed in Lucene 11
    final int
    Limits for repeatable type expressions
    static final int
    Syntax flag, enables empty language (#).
    final RegExp
    Child expressions held by a container type expression
    final RegExp
    Child expressions held by a container type expression
    final int[]
    Extents for range type expressions
    static final int
    Syntax flag, enables intersection (&).
    static final int
    Syntax flag, enables numerical intervals ( <n-m>).
    The type of expression
    final int
    Limits for repeatable type expressions
    final int
    Limits for repeatable type expressions
    static final int
    Syntax flag, enables no optional regexp syntax.
    final String
    String expression
    final int[]
    Extents for range type expressions
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs new RegExp from a string.
    RegExp(String s, int syntax_flags)
    Constructs new RegExp from a string.
    RegExp(String s, int syntax_flags, int match_flags)
    Constructs new RegExp from a string.
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns set of automaton identifiers that occur in this regular expression.
    The string that was used to construct the regex.
    Constructs new Automaton from this RegExp.
    Constructs new Automaton from this RegExp.
    toAutomaton(AutomatonProvider automaton_provider)
    Constructs new Automaton from this RegExp.
    Constructs string from parsed regular expression.
    Like to string, but more verbose (shows the hierarchy more clearly).

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • INTERSECTION

      public static final int INTERSECTION
      Syntax flag, enables intersection (&).
      See Also:
    • EMPTY

      public static final int EMPTY
      Syntax flag, enables empty language (#).
      See Also:
    • ANYSTRING

      public static final int ANYSTRING
      Syntax flag, enables anystring (@).
      See Also:
    • AUTOMATON

      public static final int AUTOMATON
      Syntax flag, enables named automata (<identifier>).
      See Also:
    • INTERVAL

      public static final int INTERVAL
      Syntax flag, enables numerical intervals ( <n-m>).
      See Also:
    • ALL

      public static final int ALL
      Syntax flag, enables all optional regexp syntax.
      See Also:
    • NONE

      public static final int NONE
      Syntax flag, enables no optional regexp syntax.
      See Also:
    • ASCII_CASE_INSENSITIVE

      @Deprecated public static final int ASCII_CASE_INSENSITIVE
      Deprecated.
      Allows case-insensitive matching of ASCII characters.

      This flag has been deprecated in favor of CASE_INSENSITIVE that supports the full range of Unicode characters. Usage of this flag now has the same behavior as CASE_INSENSITIVE

      See Also:
    • CASE_INSENSITIVE

      public static final int CASE_INSENSITIVE
      Allows case-insensitive matching of most Unicode characters.

      In general the attempt is to reach parity with Pattern Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. We support common case folding in addition to simple case folding as defined by the common (C) and simple (S) mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. This is in line with Pattern and means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match one another despite σ and ς both being lowercase characters as detailed here: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt.

      Some Unicode characters are difficult to correctly decode casing. In some cases Java's String class correctly handles decoding these but Java's Pattern class does not. We make only a best effort to maintaining consistency with Pattern and there may be differences.

      There are three known special classes of these characters:

      • 1. the set of characters whose casing matches across multiple characters such as the Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these characters fall into the ASCII range and so will behave differently when this flag is enabled
      • 2. the set of characters that are neither in an upper nor lower case stable state and can be both uppercased and lowercased from their current code point such as Dž which when uppercased produces DŽ and when lowercased produces dž; we support these
      • 3. the set of characters that when uppercased produce more than 1 character. For performance reasons we ignore characters for now, which is consistent with Pattern

      Sometimes these classes of character will overlap; if a character is in both class 3 and any other case listed above it is ignored; this is consistent with Pattern and C,S,T mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. Support for class 3 is only available with full (F) mappings, which is not supported. For instance: this character ῼ will match it's lowercase form ῳ but not it's uppercase form: ΩΙ

      Class 3 characters that when uppercased generate multiple characters such as ﬗ (0xFB17) which when uppercased produces ՄԽ (code points: 0x0544 0x053D) and are therefore ignored; however, lowercase matching on these values is supported: 0x00DF, 0x0130, 0x0149, 0x01F0, 0x0390, 0x03B0, 0x0587, 0x1E96-0x1E9A, 0x1F50, 0x1F52, 0x1F54, 0x1F56, 0x1F80-0x1FAF, 0x1FB2-0x1FB4, 0x1FB6, 0x1FB7, 0x1FBC, 0x1FC2-0x1FC4, 0x1FC6, 0x1FC7, 0x1FCC, 0x1FD2, 0x1FD3, 0x1FD6, 0x1FD7, 0x1FE2-0x1FE4, 0x1FE6, 0x1FE7, 0x1FF2-0x1FF4, 0x1FF6, 0x1FF7, 0x1FFC, 0xFB00-0xFB06, 0xFB13-0xFB17

      See Also:
    • CASE_INSENSITIVE_RANGE

      public static final int CASE_INSENSITIVE_RANGE
      Similar to CASE_INSENSITIVE but for character class ranges.

      This flag allows ranges such as [a-z] to match A, but may result in performance costs during parsing.

      See Also:
    • DEPRECATED_COMPLEMENT

      @Deprecated public static final int DEPRECATED_COMPLEMENT
      Deprecated.
      This method will be removed in Lucene 11
      Allows regexp parsing of the complement (~).

      Note that processing the complement can require exponential time, but will be bounded by an internal limit. Regexes exceeding the limit will fail with TooComplexToDeterminizeException.

      See Also:
    • kind

      public final RegExp.Kind kind
      The type of expression
    • exp1

      public final RegExp exp1
      Child expressions held by a container type expression
    • exp2

      public final RegExp exp2
      Child expressions held by a container type expression
    • s

      public final String s
      String expression
    • c

      public final int c
      Character expression
    • min

      public final int min
      Limits for repeatable type expressions
    • max

      public final int max
      Limits for repeatable type expressions
    • digits

      public final int digits
      Limits for repeatable type expressions
    • from

      public final int[] from
      Extents for range type expressions
    • to

      public final int[] to
      Extents for range type expressions
  • Constructor Details

    • RegExp

      public RegExp(String s) throws IllegalArgumentException
      Constructs new RegExp from a string. Same as RegExp(s, ALL).
      Parameters:
      s - regexp string
      Throws:
      IllegalArgumentException - if an error occurred while parsing the regular expression
    • RegExp

      public RegExp(String s, int syntax_flags) throws IllegalArgumentException
      Constructs new RegExp from a string.
      Parameters:
      s - regexp string
      syntax_flags - boolean 'or' of optional syntax constructs to be enabled
      Throws:
      IllegalArgumentException - if an error occurred while parsing the regular expression
    • RegExp

      public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException
      Constructs new RegExp from a string.
      Parameters:
      s - regexp string
      syntax_flags - boolean 'or' of optional syntax constructs to be enabled
      match_flags - boolean 'or' of match behavior options such as case insensitivity
      Throws:
      IllegalArgumentException - if an error occurred while parsing the regular expression
  • Method Details