org.apache.lucene.util.automaton.RegExp

public class RegExp extends Object

Regular Expression extension to Automaton.

Regular expressions are built from the following abstract syntax:

description of regular expression grammar
regexp	::=	unionexp
	\|
unionexp	::=	interexp `\|` unionexp	(union)
	\|	interexp
interexp	::=	concatexp `&` interexp	(intersection)	[OPTIONAL]
	\|	concatexp
concatexp	::=	repeatexp concatexp	(concatenation)
	\|	repeatexp
repeatexp	::=	repeatexp `?`	(zero or one occurrence)
	\|	repeatexp `*`	(zero or more occurrences)
	\|	repeatexp `+`	(one or more occurrences)
	\|	repeatexp `{n}`	(`n` occurrences)
	\|	repeatexp `{n,}`	(`n` or more occurrences)
	\|	repeatexp `{n,m}`	(`n` to `m` occurrences, including both)
	\|	complexp
charclassexp	::=	`[` charclasses `]`	(character class)
	\|	`[^` charclasses `]`	(negated character class)
	\|	simpleexp
charclasses	::=	charclass charclasses
	\|	charclass
charclass	::=	charexp `-` charexp	(character range, including end-points)
	\|	charexp
simpleexp	::=	charexp
	\|	`.`	(any single character)
	\|	`#`	(the empty language)	[OPTIONAL]
	\|	`@`	(any string)	[OPTIONAL]
	\|	`"` <Unicode string without double-quotes> `"`	(a string)
	\|	`(` `)`	(the empty string)
	\|	`(` unionexp `)`	(precedence override)
	\|	`<` <identifier> `>`	(named automaton)	[OPTIONAL]
	\|	`<n-m>`	(numerical interval)	[OPTIONAL]
charexp	::=	<Unicode character>	(a single non-reserved character)
	\|	`\d`	(a digit [0-9])
	\|	`\D`	(a non-digit [^0-9])
	\|	`\s`	(whitespace [ \t\n\r])
	\|	`\S`	(non whitespace [^\s])
	\|	`\w`	(a word character [a-zA-Z_0-9])
	\|	`\W`	(a non word character [^\w])
	\|	`\` <Unicode character>	(a single character)

The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (\) or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (>) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

RegExp.Kind

The type of expression represented by a RegExp node.
Field Summary

Fields

Modifier and Type

Field

Description

static final int

ALL

Syntax flag, enables all optional regexp syntax.

static final int

ANYSTRING

Syntax flag, enables anystring (@).

static final int

ASCII_CASE_INSENSITIVE

Deprecated.

static final int

AUTOMATON

Syntax flag, enables named automata (<identifier>).

final int

c

Character expression

static final int

CASE_INSENSITIVE

Allows case-insensitive matching of most Unicode characters.

static final int

CASE_INSENSITIVE_RANGE

Similar to CASE_INSENSITIVE but for character class ranges.

static final int

DEPRECATED_COMPLEMENT

Deprecated.
This method will be removed in Lucene 11

final int

digits

Limits for repeatable type expressions

static final int

EMPTY

Syntax flag, enables empty language (#).

final RegExp

exp1

Child expressions held by a container type expression

final RegExp

exp2

Child expressions held by a container type expression

final int[]

from

Extents for range type expressions

static final int

INTERSECTION

Syntax flag, enables intersection (&).

static final int

INTERVAL

Syntax flag, enables numerical intervals ( <n-m>).

final RegExp.Kind

kind

The type of expression

final int

max

Limits for repeatable type expressions

final int

min

Limits for repeatable type expressions

static final int

NONE

Syntax flag, enables no optional regexp syntax.

final String

s

String expression

final int[]

to

Extents for range type expressions
Constructor Summary

Constructors

Constructor

Description

RegExp(String s)

Constructs new RegExp from a string.

RegExp(String s, int syntax_flags)

Constructs new RegExp from a string.

RegExp(String s, int syntax_flags, int match_flags)

Constructs new RegExp from a string.
Method Summary

Modifier and Type

Method

Description

Set<String>

getIdentifiers()

Returns set of automaton identifiers that occur in this regular expression.

String

getOriginalString()

The string that was used to construct the regex.

Automaton

toAutomaton()

Constructs new Automaton from this RegExp.

Automaton

toAutomaton(Map<String,Automaton> automata)

Constructs new Automaton from this RegExp.

Automaton

toAutomaton(AutomatonProvider automaton_provider)

Constructs new Automaton from this RegExp.

String

toString()

Constructs string from parsed regular expression.

String

toStringTree()

Like to string, but more verbose (shows the hierarchy more clearly).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- INTERSECTION
  
  public static final int INTERSECTION
  
  Syntax flag, enables intersection (&).
  See Also:
  
  Constant Field Values
- EMPTY
  
  public static final int EMPTY
  
  Syntax flag, enables empty language (#).
  See Also:
  
  Constant Field Values
- ANYSTRING
  
  public static final int ANYSTRING
  
  Syntax flag, enables anystring (@).
  See Also:
  
  Constant Field Values
- AUTOMATON
  
  public static final int AUTOMATON
  
  Syntax flag, enables named automata (<identifier>).
  See Also:
  
  Constant Field Values
- INTERVAL
  
  public static final int INTERVAL
  
  Syntax flag, enables numerical intervals ( <n-m>).
  See Also:
  
  Constant Field Values
- ALL
  
  public static final int ALL
  
  Syntax flag, enables all optional regexp syntax.
  See Also:
  
  Constant Field Values
- NONE
  
  public static final int NONE
  
  Syntax flag, enables no optional regexp syntax.
  See Also:
  
  Constant Field Values
- ASCII_CASE_INSENSITIVE
  
  @Deprecated public static final int ASCII_CASE_INSENSITIVE
  
  Deprecated.
  
  Allows case-insensitive matching of ASCII characters.
  This flag has been deprecated in favor of CASE_INSENSITIVE that supports the full range of Unicode characters. Usage of this flag now has the same behavior as CASE_INSENSITIVE
  See Also:
  
  Constant Field Values
- CASE_INSENSITIVE
  
  public static final int CASE_INSENSITIVE
  Allows case-insensitive matching of most Unicode characters.
  In general the attempt is to reach parity with Pattern Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. We support common case folding in addition to simple case folding as defined by the common (C) and simple (S) mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. This is in line with Pattern and means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match one another despite σ and ς both being lowercase characters as detailed here: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt.
  Some Unicode characters are difficult to correctly decode casing. In some cases Java's String class correctly handles decoding these but Java's Pattern class does not. We make only a best effort to maintaining consistency with Pattern and there may be differences.
  There are three known special classes of these characters:
  
  1. the set of characters whose casing matches across multiple characters such as the Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these characters fall into the ASCII range and so will behave differently when this flag is enabled
  2. the set of characters that are neither in an upper nor lower case stable state and can be both uppercased and lowercased from their current code point such as ǅ which when uppercased produces Ǆ and when lowercased produces ǆ; we support these
  3. the set of characters that when uppercased produce more than 1 character. For performance reasons we ignore characters for now, which is consistent with Pattern
  
  Sometimes these classes of character will overlap; if a character is in both class 3 and any other case listed above it is ignored; this is consistent with Pattern and C,S,T mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. Support for class 3 is only available with full (F) mappings, which is not supported. For instance: this character ῼ will match it's lowercase form ῳ but not it's uppercase form: ΩΙ
  Class 3 characters that when uppercased generate multiple characters such as ﬗ (0xFB17) which when uppercased produces ՄԽ (code points: 0x0544 0x053D) and are therefore ignored; however, lowercase matching on these values is supported: 0x00DF, 0x0130, 0x0149, 0x01F0, 0x0390, 0x03B0, 0x0587, 0x1E96-0x1E9A, 0x1F50, 0x1F52, 0x1F54, 0x1F56, 0x1F80-0x1FAF, 0x1FB2-0x1FB4, 0x1FB6, 0x1FB7, 0x1FBC, 0x1FC2-0x1FC4, 0x1FC6, 0x1FC7, 0x1FCC, 0x1FD2, 0x1FD3, 0x1FD6, 0x1FD7, 0x1FE2-0x1FE4, 0x1FE6, 0x1FE7, 0x1FF2-0x1FF4, 0x1FF6, 0x1FF7, 0x1FFC, 0xFB00-0xFB06, 0xFB13-0xFB17
  See Also:
  
  Constant Field Values
- CASE_INSENSITIVE_RANGE
  
  public static final int CASE_INSENSITIVE_RANGE
  
  Similar to CASE_INSENSITIVE but for character class ranges.
  This flag allows ranges such as [a-z] to match A, but may result in performance costs during parsing.
  See Also:
  
  Constant Field Values
- DEPRECATED_COMPLEMENT
  
  @Deprecated public static final int DEPRECATED_COMPLEMENT
  
  Deprecated.
  This method will be removed in Lucene 11
  
  Allows regexp parsing of the complement (~).
  Note that processing the complement can require exponential time, but will be bounded by an internal limit. Regexes exceeding the limit will fail with TooComplexToDeterminizeException.
  See Also:
  
  Constant Field Values
- kind
  
  public final RegExp.Kind kind
  
  The type of expression
- exp1
  
  public final RegExp exp1
  
  Child expressions held by a container type expression
- exp2
  
  public final RegExp exp2
  
  Child expressions held by a container type expression
- s
  
  public final String s
  
  String expression
- c
  
  public final int c
  
  Character expression
- min
  
  public final int min
  
  Limits for repeatable type expressions
- max
  
  public final int max
  
  Limits for repeatable type expressions
- digits
  
  public final int digits
  
  Limits for repeatable type expressions
- from
  
  public final int[] from
  
  Extents for range type expressions
- to
  
  public final int[] to
  
  Extents for range type expressions
Constructor Details
- RegExp
  
  public RegExp(String s) throws IllegalArgumentException
  
  Constructs new RegExp from a string. Same as RegExp(s, ALL).
  
  Parameters:
  
  s - regexp string
  
  Throws:
  
  IllegalArgumentException - if an error occurred while parsing the regular expression
- RegExp
  
  public RegExp(String s, int syntax_flags) throws IllegalArgumentException
  
  Constructs new RegExp from a string.
  
  Parameters:
  
  s - regexp string
  
  syntax_flags - boolean 'or' of optional syntax constructs to be enabled
  
  Throws:
  
  IllegalArgumentException - if an error occurred while parsing the regular expression
- RegExp
  
  public RegExp(String s, int syntax_flags, int match_flags) throws IllegalArgumentException
  
  Constructs new RegExp from a string.
  
  Parameters:
  
  s - regexp string
  
  syntax_flags - boolean 'or' of optional syntax constructs to be enabled
  
  match_flags - boolean 'or' of match behavior options such as case insensitivity
  
  Throws:
  
  IllegalArgumentException - if an error occurred while parsing the regular expression
Method Details
- toAutomaton
  
  public Automaton toAutomaton()
  
  Constructs new Automaton from this RegExp. Same as toAutomaton(null) (empty automaton map).
- toAutomaton
  
  public Automaton toAutomaton(AutomatonProvider automaton_provider) throws IllegalArgumentException, TooComplexToDeterminizeException
  
  Constructs new Automaton from this RegExp.
  
  Parameters:
  
  automaton_provider - provider of automata for named identifiers
  
  Throws:
  
  IllegalArgumentException - if this regular expression uses a named identifier that is not available from the automaton provider
  
  TooComplexToDeterminizeException
- toAutomaton
  
  public Automaton toAutomaton(Map<String,Automaton> automata) throws IllegalArgumentException, TooComplexToDeterminizeException
  
  Constructs new Automaton from this RegExp.
  
  Parameters:
  
  automata - a map from automaton identifiers to automata (of type Automaton).
  
  Throws:
  
  IllegalArgumentException - if this regular expression uses a named identifier that does not occur in the automaton map
  
  TooComplexToDeterminizeException
- getOriginalString
  
  public String getOriginalString()
  
  The string that was used to construct the regex. Compare to toString.
- toString
  
  public String toString()
  
  Constructs string from parsed regular expression.
  
  Overrides:
  
  toString in class Object
- toStringTree
  
  public String toStringTree()
  
  Like to string, but more verbose (shows the hierarchy more clearly).
- getIdentifiers
  
  public Set<String> getIdentifiers()
  
  Returns set of automaton identifiers that occur in this regular expression.

Class RegExp

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

INTERSECTION

EMPTY

ANYSTRING

AUTOMATON

INTERVAL

ALL

NONE

ASCII_CASE_INSENSITIVE

CASE_INSENSITIVE

CASE_INSENSITIVE_RANGE

DEPRECATED_COMPLEMENT

kind

exp1

exp2

s

c

min

max

digits

from

to

Constructor Details

RegExp

RegExp

RegExp

Method Details

toAutomaton

toAutomaton

toAutomaton

getOriginalString

toString

toStringTree

getIdentifiers