The Complete Regular Expression Guide - Usage
(Page 2 of 5 )
You're probably wondering why you should bother to learn regular expressions. Well, if you're a normal computer user your benefits from using them are somewhat small, however if you're either a developer or a system administrator then you'll find that knowing regular expressions will make your (professional) life so much easier.
Developers can use them to parse text files, fix up code and perform other wonders. System administrators can use them to search through logs, automate boring tasks and sniff the network traffic for unauthorized activity.
I would go so far as to say that it's a crime for a System Administrator not to have any knowledge of regular expressions. They really are that useful!
Quantifiers Before I start explaining the syntax of a regular expression, you might want to jump to the second last page to learn which programs you can use to test out the examples in this article.
The contents of an expression are, as explained earlier, a combination of alphanumeric characters and meta-characters. An alphanumeric character is either a letter from the alphabet
abc ...or a number
123 In the world of regular expressions any character which is not a meta-character will match itself (often called literal characters), however a lot of the time you're mostly concerned with the alphanumeric characters. A very special character is the backslash \. This turns any meta-characters into literal characters, and alphanumeric characters into a sort of meta-character or sequence, such as when you use \ to escape a quote in PHP for example, or when you use two quotes in your ASP string.
The meta-characters are as follows:
\ | ( ) [ { ^ $ * + ? . < > With that said, normal characters don't sound too interesting so let's jump to our very first meta-characters.
The dot (or full stop) needs explaining first since it often leads to confusion. This character will not, as many might think, match a dot or end of a sentence. It is instead a special meta character which matches any character. Using this where you wanted to find the end of the line or the decimal in a floating number will lead to strange results. As explained above, you need to escape the character with a backslash to get its literal representation.
For instance, take this expression:
1.23 It will match the number 1.23 in text as you might have guessed, but it will also match these next lines:
1x23
1 23
1-23 To make the expression only match the floating number, we change it to
1\.23 Remember this as it's very important. Now with that said we can get the show started. Two heavily recurring meta-characters are
* and + They are called quantifiers and they tell the engine to look for several occurrences of a character. The quantifier always preceed the character at hand. The * character matches zero or more occurrences of the character in a row and the + characters is similar but matches one or more.
So what if you decided to find words that have the character c in them? You might be tempted to write:
c* What might come as a surprise to you is that you will find an enormous amount of matches, even words with no c in it will match. How so you ask? Well, the answer is simple. Recall that the * character matches zero or more characters.
You see, in regular expressions you have the possibility to match what is called the empty string, which is simply a string with zero size. This empty string can actually be found in all text. For instance, the word:
go ...contains three empty strings. They are contained at the position right before the g, in between the g and the o and after the o. And an empty string contains exactly one empty string. At first this might seem like a really silly thing to do, but you'll learn later on how this is used in more complex expressions.
So with this knowledge we might want to change our expression to:
c+ ...and voila! We get only words with c in them. The next meta-character is the question mark:
? This simply tells the engine to either match the character or not (zero or one). For instance, the expression:
cows? ...will match any of these lines:
cow
cows These three meta-characters are simply a specialized scenario for the more generalized quantifier:
{n,m} The n and m are respectively the minimum and maximum size for the quantifier. For instance:
{1,5} ...means match one or up to five characters. You can also skip m to allow for an infinite match:
{1,} ...which matches one or more characters. This is exactly what the + characters does. So now you see the connection: * is equal to {0,}, + is equal to {1,} and ? is equal to {0,1}. The last thing you can do with the quantifier is to also skip the comma entirely:
{5} This means that we want to match 5 characters -- no more no less. Here are a couple of examples before we move on:
d{1,}
ca{1,}
m\.{0,5}oooNext: Assertions >>
More ASP Articles
More By Jan Borsodi