Documentation: Regular Expressions [Remove Frame]
|
What are Regular Expressions?
|
Anatomy of Regular Expressions
|
Atoms are the simplest element in a regular expression. The simplest atoms are
individual letters. For example, the letter "a" is an atom, and it matches a
letter "a" in the subject string.
Another kind of atom is the character class, which is simply a set of characters;
for example, the atom "[abcdefg]" matches one of {a,b,c,d,e,f,g} in the subject
string. Thus, "[abcdefg]" matches "a" or "g" but not "q".
As a shorthand, you
can identify a bunch of characters by a character range, as in "[a-z]" which is
automatically translated into "[abcdefghijklmnopqrstuvwxyz]". You can see this
is a lot shorter.
You can also simply negate the set using "^"; thus "[^a]" matches any character, EXCEPT the "a". To include the "]" character itself it must be used as the first character after the "[". To include the "-" itself it must be the first character after the "[", the second character of the range, or the character just before the "]".
You can embed control characters by the usual C conventions: \t for TAB, \n for
NEWLINE, and so on. Or you can use the control-character notation, like \cH for
^H or BACKSPACE.
Other characters may be entered by octal or hexadecimal
notation: \0 for octal 0, and \xff for 0xff or decimal 255 (all 256 possible
characters are legal in the subject string, FXRex does not treat any character
specially, not even the end-of-string character).
Repetitions may optionally follow an atom, making a piece. For example, "a*" matches "", "a", "aa", and so on. The "*" operator matches zero or more occurrences. There are also several other repetition operators. The "+" operator matches one or more occurrences, and the "?" matches zero or one occurrence.
Finally, there is also
a way to match a bounded number of occurrences.
For example, "a{3}" matches "aaa", while "a{0,3}" matches "", "a", "aa", and
"aaa", but not "aaaa".
The general form of the bounded repetition is "a{n,m}" and this matches at least
n but no more than m occurrences of the preceding atom.
For convenience, we
allow omission of the n or the m; thus, "a{,m}" becomes equivalent to "a{0,m}",
and "a{n,}" becomes equivalent to "a{n,infinite}".
The special form "a{n}"
is equivalent to "a{n,n}".
Repetitions may be greedy, lazy, or possessive. Greedy repetitions are the default; greedy repetitions will try to match as many characters as possible, whereas lazy repetitions will try to match as few as possible. Finally, possessive repetitions behave like greedy ones except that there is no attempt to match fewer occurrences: if the pattern doesn't match after the maximum number of occurrences, no further attempts will be tried with fewer ones.
You can indicate a lazy repetition by appending a "?". So "a*?" is the lazy equivalent of "a*", "a??" the lazy equivalent of "a?", and "a{2,5}?" that of "a{2,5}". A possessive repetition is performed by suffixing the "+" after the end. For example, the pattern "a*+" is the possessive version of "a*".
A sequence of pieces makes up one branch. Several branches may be separated by a "|", and these become alternatives. For example, the pattern "ac|dc" matches "ac", or "dc".
Precedence rules are such that repetitions bind more strongly than sequencing, so that "ab*" matches a single "a" followed by zero or more "b"'s. Conversely, the "|" operator is the weakest, making the pattern "a|b*" match either a single "a", or a sequence of zero or more "b"'s. Explicit use of parentheses allows other precedences: "(a|b)*" is a sequence of zero or more characters, each either "a" or "b", i.e. the same as "[ab]*".
FXRex, like PERL, offers a more convenient way to specify certain frequent
character classes; they are also faster to use when matching.
For example, "\s" matches any whitespace, "\S" matches anything other
than whitespace. The "\d" matches any digit, and is equivalent to "[0-9]".
Of course,
"\D" is equivalent to "[^0-9]". The "\l" matches any letter like "[a-zA-Z]",
and "\L" any non-letter. "\w" matches any word character and is equivalent
to "[a-zA-Z_0-9]". See the table below for a full list.
You can use these character set shortcuts inside a character class; for example,
"[\dA-Fa-f]" is equivalent to "[0-9A-Fa-f]", which could actually also be done
more compactly by "\h" (hexdigits).
Assertions provide a way to match without consuming any characters. You are probably familiar with "^", which matches the begin of the line, and "$", which matches the end of the line. FXRex also provides a few additional assertions: "\<" matches word begin, i.e. a character position such that the previous character matches \W and the following character matches with \w. Likewise "\>" matches a word end. The "\b" and "\B" match word boundary and word interior, respectively. Note that the backspace character may be entered as \cH or \x8.
The ultimate in assertions is the so-called positive or negative look-ahead, and positive or negative look-behind. Lookahead effectively provides for arbitrarily complex positive or negative assertions. Unless you're familiar with PERL, you may not have seen certain features before. First, FXRex supports additional zero-width assertions [Zero width assertions are points in the recognition phase of a pattern where a match can pass or fail without consuming characters from the subject string].
The "(?= ... )" and "(?! ... )" syntax is used for positive and negative look ahead, respectively. For example "fred(?!erick)" will match "fred" and "freddy", but not "frederick".
The look-behind assertions are specified with the syntax "(?<= ...)" for positive look-behind, and "(?<! ...)" for negative look-behind. Unlike look-ahead, the look-behind string can not be totally arbitrary: it needs to be a fixed length only. This limitation may be removed in future versions of FXRex.
Jeffrey Friedl wrote what is now the standard work on regular expressions, see Mastering Regular Expressions. This book is recommended for further background on Regular Expressions.
Regular Expression Grammar
|
expression | ::= | branch { "|" branch }* |
branch | ::= | { piece }* |
piece | ::= | atom [ rep ] |
rep | ::= | ( "*" | "+" | "?" | counts ) [ "?" ] |
counts | ::= | "{" digits ["," [ digits] ] "}" |
atom | ::= | "(" expression ")" | "[" [^] range "]" | characters |
range | ::= | character | character "-" character |
characters | ::= | { character }* |
digits | ::= | { digit }* |
When parsing a pattern FXRex first performs a grammar check, and measures the resulting regex code. Subsequently it generates the pattern code. FXRex returns immediately when the syntax is found to be incorrect, and returns an appropriate error code.
Matching Operators
|
| | Alternation. |
( ... ) | Grouping sub pattern. |
(?i ... ) | Match sub pattern case insensitive. |
(?I ... ) | Match sub pattern case sensitive. |
(?n ... ) | Match sub pattern with newlines. |
(?N ... ) | Match sub pattern with no newlines. |
(?: ... ) | Non-capturing parentheses. |
[ ... ] | Character class. |
[^ ... ] | Negated character class. |
* | Match 0 or more. |
+ | Match 1 or more. |
? | Match 0 or 1. |
{} | Match 0 or more. |
{n} | Match n times. |
{,m} | Match no more than m times. |
{n,} | Match n or more. |
{n,m} | Match at least n but no more than m times. |
*? | Match 0 or more. |
+? | Match 1 or more. |
?? | Match 0 or 1. |
{}? | Match 0 or more times. |
{n}? | Match n times. |
{,m}? | Match no more than m times. |
{n,}? | Match n or more. |
{n,m}? | Match at least n but no more than m times. |
\a | Alarm, bell. |
\e | Escape character. |
\t | Tab. |
\f | Form feed. |
\n | Newline. |
\r | Return. |
\v | Vertical tab. |
\cx | Control character. |
\033 | Octal. |
\x1b | Hex. |
. | Match any character. |
\w | Word character [a-zA-Z_0-9]. |
\W | Non-word character. |
\l | Letter [a-zA-Z]. |
\L | Non-letter. |
\s | Space. |
\S | Non-space. |
\d | Digit [0-9]. |
\D | Non-digit. |
\h | Hex digit [0-9a-fA-F]. |
\H | Non-hex digit. |
\u | Single uppercase character. |
\U | Single lowercase character. |
\p | Punctuation (not including '_'). |
\P | Non punctuation. |
^ | Match begin of line [if at begin of pattern]. |
$ | Match end of line [if at end of pattern]. |
\< | Begin of word. |
\> | End of word. |
\b | Word boundary. |
\B | Word interior. |
\A | Match only beginning of string. |
\Z | Match only and end of string. |
(?= ... ) | Positive lookahead. |
(?! ... ) | Negative lookahead. |
(?<= ... ) | Positive lookbehind. |
(?<! ... ) | Negative lookbehind. |
\1 ... \9 | Back reference. |
The FXRex Class
|
FXRex contains the following member functions:
FXRex()
The default constructor initializes FXRex to the fallback or empty pattern.
FXRex(const FXRex& orig)
The copy constructor initializes FXRex to the a copy of the original pattern.
FXRex(const FXchar* pattern,FXint mode=REX_NORMAL,FXRexError* error=NULL)
Parse the given pattern by calling parse. If the parameter error is not NULL, the return code of the parse will be assigned to its contents; this code will be set to REGERR_OK if the parse succeeded, or some error code if it failed.
FXRex(const FXString& pattern,FXint mode=REX_NORMAL,FXRexError* error=NULL)
Same as above, only taking an FXString as argument.
FXRex& operator=(const FXRex& orig)
Assigns the compiled pattern of orig into this FXRex.
FXbool empty()
Returns TRUE if the pattern is empty, i.e. equal to the fallback pattern which matches nothing.
FXRexError parse(const FXchar* pattern,FXint mode=REX_NORMAL)
Parse the given pattern, returning REGERR_OK if successful, and an error code if a syntax error has been detected. The mode parameter is the bitwise OR of some flags which control various aspects of the regular expression code being generated:When parsing fails, or when REX_SYNTAX is passed, the regular expression object will be initialized to the fallback program; in other words, the regular expression will be empty. All matches attempted with the empty pattern will fail.
- REX_NORMAL. This is the default; the normal mode does not generate capturing parentheses; this corresponds to the most common use of regular expression matching.
- REX_CAPTURE. This flag enables the use of capturing parentheses, and back-references.
- REX_ICASE. Case insensitive matching is enabled. When backreferences are also enabled, through the REX_CAPTURE flag, they become insensitive to case as well, so that a pattern "(aa)bb\1" will match "aabbAA".
The REX_ICASE mode can be changed using cloistered mode expressions of the form "(?i ... )", which enables case-insensitive mode only for the given subexpression, and "(?I ... )", which enables case-sensitive mode for the subexpression.
- REX_NEWLINE. This will cause the any character type operations to also match newlines.
The expressions "(?n ... )" and (?N ... )" can be used to change the REX_NEWLINE mode for a subexpression.
- REX_VERBATIM. This flag turns off all the magic characters (including backslash escape sequences). The corresponding regular expression program therefore degenerates into a simple literal match. If REX_ICASE has also been passed, a literal case-insensitive match results.
The REX_VERBATIM flag is useful to allow building regular expression programs even when no special matching is required.
- REX_SYNTAX. When this flag is passed, the pattern is parsed normally, but no code is generated. This option is useful to test the regular expression pattern for syntactical correctness only.
FXRexError parse(const FXString& pattern,FXint mode=REX_NORMAL)
Same as above, only taking an FXString as argument.
FXbool match(const FXchar* string,FXint len,FXint* beg=NULL,FXint* end=NULL,FXint mode=REX_FORWARD,FXint npar=1,FXint fm=0,FXint to=2147483647)
Match the given subject string of length len, and returns TRUE if the pattern matches. if beg and end are not NULL, (beg[0],end[0]) will contain the offsets relative to the begin of string where the match started/ended, and (beg[i],end[i]) will contain the offsets where sub-expression i started/ended, or (-1,-1) if the corresponding subexpression was not matched. The search is performed in the range [fm,to].The mode parameter is a bitwise OR of some flags which control how the string is to be matched by the pattern:
- REX_FORWARD. This is the default. REX_FORWARD causes the matcher to scan the subject string starting from offset fm up to and including offset to.
- REX_BACKWARD. Scan the subject string, moving backwards starting from to down to offset fm. An important special case is when fm and to are the same; in this case, only a single attempt is made at the indicated offset. Observe also that while the scan is performed between fm and to, the actual subject string matched may extend past this range!
- REX_NOT_BOL. Normally, it is assumed that the start of the subject string is also the start of a line; however passing the flag REX_NOT_BOL suppresses this behavior. When passed, only positions immediately following a newline will match the "^" assertion.
- REX_NOT_EOL. This suppresses the interpretation of the end of the subject string as the end of a line; when passed, only positions preceding a newline will match the "$" assertion.
The flags REX_NOT_BOL and REX_NOT_EOL are useful when you need to match against a random chunk of text.
- REX_NOT_EMPTY. Disallow empty strings from matching. When passed, an empty string is NOT considered a match; for example, if the pattern:
a?b?is applied to a string not beginning with "a" or "b", it matches the empty string at the start of the subject. With the REX_NOT_EMPTY flag, this match is not valid, so FXRex searches further into the string for occurrences of "a" or "b". For example, when searching for pattern "a*" in "bbba", normally "" would be matched, as zero repetitions of "a" is normally possible. With REX_NOT_EMPTY, the single "a" will be matched instead.
This is usually what people expect to happen.The parameter npar controls the length of the beg and end arrays. It should be set to at least 1. If the pattern has been compiled with REX_CAPTURE, any capturing sub-expressions at a level less than npar will be returned into the arrays beg and end. Backreferences only work if capturing parentheses are enabled with REX_CAPTURE, although it is OK to pass npar=1 if you're not actually interested in the values of the captured parentheses: the backreferences and capturing parentheses use independent arrays.
To capture sub-expressions, is important to pass a value for npar which is at least as large as the expected number of capturing subexpressions, but the value of npar does not affect the matching process itself due to the use of independent arrays.The parameters fm and to furnish the matcher with the range of the text to be searched. Making fm and to equal will force the matcher to perform a single matching attempt at the given offset.
FXbool match(const FXString& string,FXint* beg=NULL,FXint* end=NULL,FXint mode=REX_FORWARD,FXint npar=1,FXint fm=0,FXint to=2147483647)
Same as above, only taking an FXString parameter.
static const FXchar* getError(FXRexError error)
Returns a pointer to a string containing a human-readable error message for the given error code.
FXbool operator==(const FXRex& r1,const FXRex& r2)
FXbool operator!=(const FXRex& r1,const FXRex& r2)
Compares regular expression r1 and r2. error code.
FXStream& operator<<(FXStream& store,const FXRex& s)
FXStream& operator>>(FXStream& store,FXRex& s)
Serialize and deserialize regular expression s to/from stream store.
Using the FXRex class
|
// A letter or underscore followed by a letters, digits, or underscores FXRex identifier("[a-zA-Z_][a-zA-Z0-9_]*"); FXString string; ... if(identifier.match(string)){ /* found it ... */ } |
The usage above is the simplest possible:- we just want to know if the pattern is contained in the string. If we need to execute a pattern only once, more crufty C++ implementor would probably write:
if(FXRex("[a-zA-Z_][a-zA-Z0-9_]*").match(string)){ /* found it ... */ } |
Its so nice to be able to do that in 1 line of code, isn't it?
Most of the time, we want to know more; not only whether there was a match or not,
but also where in the string the pattern was found:
// A letter or underscore followed by a letters, digits, or underscores FXRex identifier("[a-zA-Z_][a-zA-Z0-9_]*"); FXString string; FXint beg,end; ... if(identifier.match(string,&beg,&end)){ // Return the matching part return string.mid(beg,end-beg); } |
If we have enabled capturing parentheses, we can extract even more information; not only where the whole pattern matched, but also where each subpattern was found. The following code fragment which breaks up a floating point number into a sign,mantissa, and exponent using the pattern "(-|\+|)(\d*\.\d*)(E(-|\+|)\d+)?" illustrates the technique:
// Pick apart a floating point number // Note that backslashes must be escaped in C++! FXRex number("(-|\\+|)(\\d*\\.\\d*)(E(-|\\+|)\\d+)?",REX_CAPTURE); FXString string,sign,mantissa,exponent; FXint beg[5],end[5]; ... if(number.match(string,beg,end,REX_FORWARD,5)){ // Get the sign sign=string.mid(beg[1],end[1]-beg[1]); // Get the mantissa mantissa=string.mid(beg[2],end[2]-beg[2]); // Get the exponent exponent=string.mid(beg[3],end[3]-beg[3]); } |
Note that we're passing npar=5 as the last argument because there are 5 pairs
of values returned in the beg[] and end[] arrays: (beg[0],end[0]) contains the
entire pattern, (beg[1],end[1]) the sign, (beg[2],end[2]) the mantissa, (beg[3],end[3])
the exponent, and (beg[4],end[4]) the sign of the exponent.
If a particular sub-expression is not matched (for example, if there is no exponent),
then the corresponding entry will contain (-1,-1).
Because it needs to record where a sub-expression matches, its quite expensive
to use capturing parentheses when we don't need to. Fortunately, there's a
solution for this:
// Pick apart a floating point number FXRex number("(-|\\+|)(\\d*\\.\\d*)(E(?:-|\\+|)\\d+)?",REX_CAPTURE); FXString string,sign,mantissa,exponent; FXint beg[4],end[4]; ... if(number.match(string,beg,end,REX_FORWARD,4)){ // Get the sign sign=string.mid(beg[1],end[1]-beg[1]); // Get the mantissa mantissa=string.mid(beg[2],end[2]-beg[2]); // Get the exponent exponent=string.mid(beg[3],end[3]-beg[3]); } |
The syntax "(?: ... )" makes the enclosed expression a non-capturing one. Note that this also means we can make do with one less entry in the beg and end arrays, as we're getting one fewer return value.
If the npar parameter is too small, then any capturing parentheses level greater than npar will behave as before, but will simply not be returned. If back references are used, a back reference to a capturing parenthesis will work simply referencing the internal capturing arrays which are enabled by passing REX_CAPTURE when parsing the pattern.
The example below determines if there is a floating point number at offset pos in the string:
// Does string starting at pos match a floating point number? FXRex number("(-|\\+|)(\\d*\\.\\d*)(E(-|\\+|)\\d+)?",REX_NORMAL); FXString string; FXint pos; ... if(number.match(string,NULL,NULL,REX_FORWARD,1,pos,pos)){ ... // YES ... } |
Since both starting and ending offsets are the same, the matcher makes only a single attempt at matching the pattern at the offset pos; it does not scan the whole subject string.
Copyright © 1997-2022 Jeroen van der Zijp |