Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).
Regular Expressions - Creating Patterns (Page 2 of 11 )
This section presents some simple techniques for writing your own regular expressions. I think of them as the push, the pull, and the composition. As in the Japanese martial art Judo, if your opponent is pushing against you, you pull him. If heís pulling away, you push. If those techniques donít work, you compose him into a pretzel.
Similarly, writing a regular expression will sometimes seem impervious to certain approaches, but very susceptible to others. The methods I describe in the following sections are only simple techniques for writing patterns. If you havenít already done so, youíll soon cultivate your own bag of regex tricks. You may even develop pet names for them.
The Pull Technique
One of the most successful ways to create regular expressions consists of taking an exact match and then slowly morphing it into a generic regular expression that matches the original. I think of this as the pull technique, because Iím slowly pulling the regular expression out of the exact match.
For example, imagine that you want to create a pattern to match four-digit numbers. Thus, 1234 would be a match, but 123 would not, and neither would 12345 or ABCD.
For the sake of this example, youíll need to introduce a single regular expression metacharacter, \d, that will match any digit ranging from 0 to 9.
NOTE A metacharacter describes another, more complex character. For example, \n is a metacharacter describing the nonprintable newline character.
Going back to the example, you know that
1234 matches 1234
This is, of course, obvious: Anything will match itself. However, you also know that \d matches any digit. By the transitive property of logic, you can substitute \d for the last digit. Thus, the pattern becomes
1234 should_match 123\d
Here you replace the last digit, 4, with the equivalent metacharacter, \d. If you run this pattern though the handy RX.java program, you can see that it does, in fact, continue to match. So far, so good. Actually, itís better than good: Now you have a pattern that will match not only 1234, but also any four-digit number beginning with the digits 123. Weíre getting closer.
NOTE RX.java is a very short companion program for this book that you can obtain from Downloads section of the Apress Web site (http://www.apress.com). You can use this program to execute regular expression patterns against a candidate string.
Repeat the process on the third digit, so that 1234 should match 12\d\d, where you replace the 3 with the equivalent \d. Things are looking up. Not only does this match 1234, but also it matches any four-digit number beginning with the digits 12.
You can see where this is going. Eventually, youíll create the pattern \d\d\d\d, which will match any four digits. This isnít the most succinct pattern, but itís sufficient to meet the stated need: It matches any four-digit number.
The point here is that you can, in principle, sometimes work backward from a specific match to create the pattern you need. Of course, this is just a technique, and it wonít work for all situations. However, itís a good method to put into your regex bag of tricks.
The Push Technique
Another technique that Iíve found to be helpful in writing regular expression patterns is the push technique. The push technique builds on previous work by either adding to it, subtracting from it, or modifying its scope until itís useful.
Instead of working with a specific matching token, as in the pull technique, this approach takes a preexisting regular expression thatís similar to the one you need and modifies it until it does the required job. That is, the regular expression is pushed into another functionality, hence the name.
For example, say you want a regex pattern that matches five digits. Based on the previous example, you know that \d\d\d\d will match any four digits. Thus, the process of finding a match for a five-digit match is as easy as appending another \dto the previous pattern. The answer, of course, is the pattern \d\d\d\d\d.
As you progress though this chapter, youíll learn that these arenít the most elegant representations of the four-digit and five-digit matching patterns you could come up with, but theyíre perfectly legitimate solutions, and theyíre reasonably derived. That process of derivation is the important point to take away from this discussion.
The Composition Technique
The composition technique does exactly what its name implies: It puts together various patterns to form a new whole. That is, itís the composition of a new pattern by using other patterns. This is distinct from the push technique in that patterns arenít modified; rather, theyíre simply appended.
Assume that you need to create a pattern that will match United States zip codes, which consist of five digits, followed by a hyphen character, followed by four digits. Based on the work youíve already done, this pattern is very easy to create. You know that four digits match \d\d\d\d, that a hyphen matches itself, and that five digits match \d\d\d\d\d. Composing these into a single pattern yields the pattern \d\d\d\d\d-\d\d\d\d\d.
Again, this isnít the most elegant and concise representation for a zip code, and it isnít very permissive (what about five-digit zip codes? What if there are spaces between the hyphen and the digits? What if there is no hyphen, just a space?), but it does meet the stated requirement.
NOTE As with most software problems, you can often find the solution to a regex conundrum by clarifying the requirements.
Introducing the Regular Expression Syntax
The following sections introduce Javaís regular expression syntax. For the sake of clarity, the material is grouped into small, logical units, followed by a brief example that demonstrates usage. The examples progress from those that emphasize the role of the Pattern to those that start to rely on the Matcher more.
NOTE Please keep in mind that these are working examples only. Weíre not ready to bulletproof our code yet.
The regex language contains metacharacters designed to help you describe search criteria. Because reading a pattern without being aware of these characters can be a bewildering experience, Iíve listed the most popular metacharacters are in Table 1-1.
These characters are effectively reserved words, just as new is a reserved word in Java. They serve as building blocks for more complex search criteria. I discuss this in more detail soon.
Table 1-1.Basic Regex Delimiter Characters
Matches any character.
Matches the end of a line.
Matches the beginning of a line.
Opening curly bracket
Defines a range opening.
Defines a character class opening.
Defines the beginning of a group.
A symbol meaning OR
Closing curly bracket
Defines a range closing.
Defines a character class closing.
Defines the closing of a group.
The preceding is repeated zero or more times.
The preceding is repeated one or more times.
The preceding is repeated zero or one time.
The following is not to be treated as a metacharacter.
If youíre reading a character in a regex pattern and it isnít one of characters listed in Table 1-1, then the character youíre reading probably stands for the character it represents. For example, Table 1-2 shows how the pattern hello*should be read.
Table 1-2. The Pattern
Followed by the character
Followed by the character
Followed by the character
Followed by the character
Followed by a metacharacter that, in this case, means
o should be repeated zero or more times
In English: Look for the word hell, followed by any number of trailing o characters.
If you actually need to find one of these characters, such as the *character, simply append the character youíre searching for to a \character. For example, to find the *character, use \*.