Class RegExp
Automaton.
Regular expressions are built from the following abstract syntax:
| regexp | ::= | unionexp | ||
| | | ||||
| unionexp | ::= | interexp | unionexp |
(union) | |
| | | interexp | |||
| interexp | ::= | concatexp & interexp |
(intersection) | [OPTIONAL] |
| | | concatexp | |||
| concatexp | ::= | repeatexp concatexp | (concatenation) | |
| | | repeatexp | |||
| repeatexp | ::= | repeatexp ? |
(zero or one occurrence) | |
| | | repeatexp * |
(zero or more occurrences) | ||
| | | repeatexp + |
(one or more occurrences) | ||
| | | repeatexp {n} |
(n occurrences) |
||
| | | repeatexp {n,} |
(n or more occurrences) |
||
| | | repeatexp {n,m} |
(n to m occurrences, including both) |
||
| | | complexp | |||
| charclassexp | ::= | [ charclasses ] |
(character class) | |
| | | [^ charclasses ] |
(negated character class) | ||
| | | simpleexp | |||
| charclasses | ::= | charclass charclasses | ||
| | | charclass | |||
| charclass | ::= | charexp - charexp |
(character range, including end-points) | |
| | | charexp | |||
| simpleexp | ::= | charexp | ||
| | | . |
(any single character) | ||
| | | # |
(the empty language) | [OPTIONAL] | |
| | | @ |
(any string) | [OPTIONAL] | |
| | | " <Unicode string without double-quotes> " |
(a string) | ||
| | | ( ) |
(the empty string) | ||
| | | ( unionexp ) |
(precedence override) | ||
| | | < <identifier> > |
(named automaton) | [OPTIONAL] | |
| | | <n-m> |
(numerical interval) | [OPTIONAL] | |
| charexp | ::= | <Unicode character> | (a single non-reserved character) | |
| | | \d |
(a digit [0-9]) | ||
| | | \D |
(a non-digit [^0-9]) | ||
| | | \s |
(whitespace [ \t\n\r]) | ||
| | | \S |
(non whitespace [^\s]) | ||
| | | \w |
(a word character [a-zA-Z_0-9]) | ||
| | | \W |
(a non word character [^\w]) | ||
| | | \ <Unicode character> |
(a single character) |
The productions marked [OPTIONAL] are only allowed if specified by the syntax
flags passed to the RegExp constructor. The reserved characters used in the
(enabled) syntax must be escaped with backslash (\) or double-quotes (
"..."). (In contrast to other regexp syntaxes, this is required also in character
classes.) Be aware that dash (-) has a special meaning in charclass
expressions. An identifier is a string not containing right angle bracket (>
) or dash (-). Numerical intervals are specified by non-negative
decimal integers and include both end points, and if n and m
have the same number of digits, then the conforming strings must have that length (i.e.
prefixed by 0's).
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumThe type of expression represented by a RegExp node. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intSyntax flag, enables all optional regexp syntax.static final intSyntax flag, enables anystring (@).static final intAllows case insensitive matching of ASCII characters.static final intSyntax flag, enables named automata (<identifier>).final intCharacter expressionstatic final intDeprecated.This method will be removed in Lucene 11final intLimits for repeatable type expressionsstatic final intSyntax flag, enables empty language (#).final RegExpChild expressions held by a container type expressionfinal RegExpChild expressions held by a container type expressionfinal intExtents for range type expressionsstatic final intSyntax flag, enables intersection (&).static final intSyntax flag, enables numerical intervals (<n-m>).final RegExp.KindThe type of expressionfinal intLimits for repeatable type expressionsfinal intLimits for repeatable type expressionsstatic final intSyntax flag, enables no optional regexp syntax.final StringString expressionfinal intExtents for range type expressions -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionReturns set of automaton identifiers that occur in this regular expression.The string that was used to construct the regex.Constructs newAutomatonfrom thisRegExp.toAutomaton(Map<String, Automaton> automata) Constructs newAutomatonfrom thisRegExp.toAutomaton(AutomatonProvider automaton_provider) Constructs newAutomatonfrom thisRegExp.toString()Constructs string from parsed regular expression.Like to string, but more verbose (shows the higherchy more clearly).
-
Field Details
-
INTERSECTION
public static final int INTERSECTIONSyntax flag, enables intersection (&).- See Also:
-
EMPTY
public static final int EMPTYSyntax flag, enables empty language (#).- See Also:
-
ANYSTRING
public static final int ANYSTRINGSyntax flag, enables anystring (@).- See Also:
-
AUTOMATON
public static final int AUTOMATONSyntax flag, enables named automata (<identifier>).- See Also:
-
INTERVAL
public static final int INTERVALSyntax flag, enables numerical intervals (<n-m>).- See Also:
-
ALL
public static final int ALLSyntax flag, enables all optional regexp syntax.- See Also:
-
NONE
public static final int NONESyntax flag, enables no optional regexp syntax.- See Also:
-
ASCII_CASE_INSENSITIVE
public static final int ASCII_CASE_INSENSITIVEAllows case insensitive matching of ASCII characters.- See Also:
-
DEPRECATED_COMPLEMENT
Deprecated.This method will be removed in Lucene 11Allows regexp parsing of the complement (~).Note that processing the complement can require exponential time, but will be bounded by an internal limit. Regexes exceeding the limit will fail with TooComplexToDeterminizeException.
- See Also:
-
kind
The type of expression -
exp1
Child expressions held by a container type expression -
exp2
Child expressions held by a container type expression -
s
String expression -
c
public final int cCharacter expression -
min
public final int minLimits for repeatable type expressions -
max
public final int maxLimits for repeatable type expressions -
digits
public final int digitsLimits for repeatable type expressions -
from
public final int fromExtents for range type expressions -
to
public final int toExtents for range type expressions
-
-
Constructor Details
-
RegExp
Constructs newRegExpfrom a string. Same asRegExp(s, ALL).- Parameters:
s- regexp string- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
RegExp
Constructs newRegExpfrom a string.- Parameters:
s- regexp stringsyntax_flags- boolean 'or' of optional syntax constructs to be enabled- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
RegExp
Constructs newRegExpfrom a string.- Parameters:
s- regexp stringsyntax_flags- boolean 'or' of optional syntax constructs to be enabledmatch_flags- boolean 'or' of match behavior options such as case insensitivity- Throws:
IllegalArgumentException- if an error occurred while parsing the regular expression
-
-
Method Details
-
toAutomaton
Constructs newAutomatonfrom thisRegExp. Same astoAutomaton(null)(empty automaton map). -
toAutomaton
public Automaton toAutomaton(AutomatonProvider automaton_provider) throws IllegalArgumentException, TooComplexToDeterminizeException Constructs newAutomatonfrom thisRegExp.- Parameters:
automaton_provider- provider of automata for named identifiers- Throws:
IllegalArgumentException- if this regular expression uses a named identifier that is not available from the automaton providerTooComplexToDeterminizeException
-
toAutomaton
public Automaton toAutomaton(Map<String, Automaton> automata) throws IllegalArgumentException, TooComplexToDeterminizeExceptionConstructs newAutomatonfrom thisRegExp.- Parameters:
automata- a map from automaton identifiers to automata (of typeAutomaton).- Throws:
IllegalArgumentException- if this regular expression uses a named identifier that does not occur in the automaton mapTooComplexToDeterminizeException
-
getOriginalString
The string that was used to construct the regex. Compare to toString. -
toString
Constructs string from parsed regular expression. -
toStringTree
Like to string, but more verbose (shows the higherchy more clearly). -
getIdentifiers
Returns set of automaton identifiers that occur in this regular expression.
-