The Complete Regular Expression Guide - Assertions
(Page 3 of 5 )
The next type of meta-character is the assertion. These will match if a given assertion is true. The first pair of assertions are
^ and $ ...which match the beginning of the line and the end of the line. Note that some regular expression implementations allow you to change their behavior so that they will instead match the beginning of the text and the end of the text. These assertions always match a zero length string, or in other words they match a position. For instance if you wrote this expression:
^The Then it would match any line that began with the word "The". The next assertion characters match at the beginning and end of a word, they are:
< and > They come in handy when you want to match a word precisely, for instance:
<cow> ...would match any of the following words:
cow
coward
cowage
cowboy
cowl One last thing to be said is that all literal characters are in fact assertions themselves. The difference between those and the ones above is that literal characters have a size. So for cleanliness sake, we only use the word assertions for those that are zero-width.
Groups and Alternation One thing that you might have noticed when we explained quantifiers is that they only worked on the character to the left. Since this pretty much limits our expressions, I'll explain other uses for quantifiers. Quantifiers can also be used on meta-characters. Using them on assertions is silly since they are zero-width and matching one, two, three or more of them doesn't do us any good. However, the grouping and sequence meta-characters are perfect for being quantified. Let's first start with grouping.
You can form groups (or sub expressions as they are frequently called) by using the begin and end parenthesis characters:
( and ) The ( character starts the sub expression and the ) character ends it. It is also possible to have one or more sub expressions inside a sub expression. The sub expression will match if the contents match. So mixing this with quantifiers and assertions you can do something like this:
( ?ho)+ ...which matches all of the following lines:
ho
ho ho
ho ho ho
hohoho Another use for sub expressions is to extract a portion of the match if it matches. This is often used in conjunction with sequences which are discussed later.
Next up are alternations, which allow you to match on more than one pattern. The alternation character looks like this:
| Here's how we can use it:
Bill|Linus|Steve|Larry The regular expression above would match either Bill, Linus, Steve or Larry, and mixing this with sub expressions and quantifiers, we can do something like this:
cow(ard|age|boy|l)? ...which matches any of the following words but no other combinations:
cow
coward
cowage
cowboy
cowl I mentioned earlier in the article that not all of the expressions must match for the match to be successful. This can happen when you're using sub expressions together with alternations. For example:
((Donald|Dolly) Duck)|(Scrooge McDuck) As you see, only the left or right top sub expression will match, not both. This is sometimes handy when you want to run a complex pattern in one sub expression and if it fails try another one.
Sequences Lastly, we have sequences, which define sequences of characters that can match. Sometimes you don't want to match a word directly but would rather match something that resembles one. The sequence characters are
[ and ] Any characters put inside the sequence brackets are treated as literal characters -- even meta-characters. The only special characters are the - which denotes character ranges, and the ^, which is used to negate a sequence (i.e. return a match for anything that doesn’t match the sequence).
For example,
[a-z] ...will match any small characters which are in the English alphabet (a to z). Another common sequence is
[a-zA-Z0-9] ...which matches any small or capital characters in the English alphabet, as well as numbers. Sequences are also mixed with quantifiers and assertions to produce more elaborate searches. For example
<[a-zA-Z]+> ...matches all whole words. This will match
cow
Linus
regular
expression ...but will not match
200
x-files
C++ Now, what if you wanted to find anything but complete words? The expression
[^a-zA-Z0-9]+ ...would find any sequences of characters which do not contain all alphanumeric characters (remember that ^ negates a match, i.e. returns true if the match failed).
Some implementations of regular expressions allow you to use shorthand versions for commonly used sequences. They are:
\d, a digit [0-9]
\D, a non-digit [^0-9]
\w, a word (alphanumeric) [a-zA-Z0-9]
\W, a non-word [^a-zA-Z0-9]
\s, a whitespace [ \t\n\r\f]
\S, a non-whitespace [^ \t\n\r\f]Next: Wildcards >>
More ASP Articles
More By Jan Borsodi