org.apache.lucene.util.automaton
Class RegExp

java.lang.Object
  extended by org.apache.lucene.util.automaton.RegExp

public class RegExp
extends Object

Regular Expression extension to Automaton.

Regular expressions are built from the following abstract syntax:

regexp ::= unionexp
|
unionexp ::= interexp | unionexp (union)
| interexp
interexp ::= concatexp & interexp (intersection) [OPTIONAL]
| concatexp
concatexp ::= repeatexp concatexp (concatenation)
| repeatexp
repeatexp ::= repeatexp ? (zero or one occurrence)
| repeatexp * (zero or more occurrences)
| repeatexp + (one or more occurrences)
| repeatexp {n} (n occurrences)
| repeatexp {n,} (n or more occurrences)
| repeatexp {n,m} (n to m occurrences, including both)
| complexp
complexp ::= ~ complexp (complement) [OPTIONAL]
| charclassexp
charclassexp ::= [ charclasses ] (character class)
| [^ charclasses ] (negated character class)
| simpleexp
charclasses ::= charclass charclasses
| charclass
charclass ::= charexp - charexp (character range, including end-points)
| charexp
simpleexp ::= charexp
| . (any single character)
| # (the empty language) [OPTIONAL]
| @ (any string) [OPTIONAL]
| " <Unicode string without double-quotes>  " (a string)
| ( ) (the empty string)
| ( unionexp ) (precedence override)
| < <identifier> > (named automaton) [OPTIONAL]
| <n-m> (numerical interval) [OPTIONAL]
charexp ::= <Unicode character> (a single non-reserved character)
| \ <Unicode character>  (a single character)

The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (\) or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (>) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary
static int ALL
          Syntax flag, enables all optional regexp syntax.
static int ANYSTRING
          Syntax flag, enables anystring (@).
static int AUTOMATON
          Syntax flag, enables named automata (<identifier>).
static int COMPLEMENT
          Syntax flag, enables complement (~).
static int EMPTY
          Syntax flag, enables empty language (#).
static int INTERSECTION
          Syntax flag, enables intersection (&).
static int INTERVAL
          Syntax flag, enables numerical intervals ( <n-m>).
static int NONE
          Syntax flag, enables no optional regexp syntax.
 
Constructor Summary
RegExp(String s)
          Constructs new RegExp from a string.
RegExp(String s, int syntax_flags)
          Constructs new RegExp from a string.
 
Method Summary
 Set<String> getIdentifiers()
          Returns set of automaton identifiers that occur in this regular expression.
 boolean setAllowMutate(boolean flag)
          Sets or resets allow mutate flag.
 Automaton toAutomaton()
          Constructs new Automaton from this RegExp.
 Automaton toAutomaton(AutomatonProvider automaton_provider)
          Constructs new Automaton from this RegExp.
 Automaton toAutomaton(Map<String,Automaton> automata)
          Constructs new Automaton from this RegExp.
 String toString()
          Constructs string from parsed regular expression.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

INTERSECTION

public static final int INTERSECTION
Syntax flag, enables intersection (&).

See Also:
Constant Field Values

COMPLEMENT

public static final int COMPLEMENT
Syntax flag, enables complement (~).

See Also:
Constant Field Values

EMPTY

public static final int EMPTY
Syntax flag, enables empty language (#).

See Also:
Constant Field Values

ANYSTRING

public static final int ANYSTRING
Syntax flag, enables anystring (@).

See Also:
Constant Field Values

AUTOMATON

public static final int AUTOMATON
Syntax flag, enables named automata (<identifier>).

See Also:
Constant Field Values

INTERVAL

public static final int INTERVAL
Syntax flag, enables numerical intervals ( <n-m>).

See Also:
Constant Field Values

ALL

public static final int ALL
Syntax flag, enables all optional regexp syntax.

See Also:
Constant Field Values

NONE

public static final int NONE
Syntax flag, enables no optional regexp syntax.

See Also:
Constant Field Values
Constructor Detail

RegExp

public RegExp(String s)
       throws IllegalArgumentException
Constructs new RegExp from a string. Same as RegExp(s, ALL).

Parameters:
s - regexp string
Throws:
IllegalArgumentException - if an error occured while parsing the regular expression

RegExp

public RegExp(String s,
              int syntax_flags)
       throws IllegalArgumentException
Constructs new RegExp from a string.

Parameters:
s - regexp string
syntax_flags - boolean 'or' of optional syntax constructs to be enabled
Throws:
IllegalArgumentException - if an error occured while parsing the regular expression
Method Detail

toAutomaton

public Automaton toAutomaton()
Constructs new Automaton from this RegExp. Same as toAutomaton(null) (empty automaton map).


toAutomaton

public Automaton toAutomaton(AutomatonProvider automaton_provider)
                      throws IllegalArgumentException
Constructs new Automaton from this RegExp. The constructed automaton is minimal and deterministic and has no transitions to dead states.

Parameters:
automaton_provider - provider of automata for named identifiers
Throws:
IllegalArgumentException - if this regular expression uses a named identifier that is not available from the automaton provider

toAutomaton

public Automaton toAutomaton(Map<String,Automaton> automata)
                      throws IllegalArgumentException
Constructs new Automaton from this RegExp. The constructed automaton is minimal and deterministic and has no transitions to dead states.

Parameters:
automata - a map from automaton identifiers to automata (of type Automaton).
Throws:
IllegalArgumentException - if this regular expression uses a named identifier that does not occur in the automaton map

setAllowMutate

public boolean setAllowMutate(boolean flag)
Sets or resets allow mutate flag. If this flag is set, then automata construction uses mutable automata, which is slightly faster but not thread safe. By default, the flag is not set.

Parameters:
flag - if true, the flag is set
Returns:
previous value of the flag

toString

public String toString()
Constructs string from parsed regular expression.

Overrides:
toString in class Object

getIdentifiers

public Set<String> getIdentifiers()
Returns set of automaton identifiers that occur in this regular expression.



Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.