Home arrow Java arrow Page 2 - Regular Expressions
JAVA

Regular Expressions


Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 28
July 28, 2005
TABLE OF CONTENTS:
  1. · Regular Expressions
  2. · Creating Patterns
  3. · Common and Boundary Characters
  4. · Character Classes
  5. · Back References
  6. · Integrating Java with Regular Expressions
  7. · Confirming Name Formats Example
  8. · Finding Duplicate Words Example
  9. · Regular Expression Operations
  10. · Search and Replace
  11. · Comparing Regex and Perl

print this article
SEARCH DEVARTICLES

Regular Expressions - Creating Patterns
(Page 2 of 11 )

This section presents some simple techniques for writing your own regular expressions. I think of them as the push, the pull, and the composition. As in the Japanese martial art Judo, if your opponent is pushing against you, you pull him. If heís pulling away, you push. If those techniques donít work, you compose him into a pretzel.

Similarly, writing a regular expression will sometimes seem impervious to certain approaches, but very susceptible to others. The methods I describe in the following sections are only simple techniques for writing patterns. If you havenít already done so, youíll soon cultivate your own bag of regex tricks. You may even develop pet names for them.

The Pull Technique

One of the most successful ways to create regular expressions consists of taking an exact match and then slowly morphing it into a generic regular expression that matches the original. I think of this as the pull technique, because Iím slowly pulling the regular expression out of the exact match.

For example, imagine that you want to create a pattern to match four-digit numbers. Thus, 1234 would be a match, but 123 would not, and neither would 12345 or ABCD.

For the sake of this example, youíll need to introduce a single regular expression metacharacter, \d, that will match any digit ranging from 0 to 9.

NOTE  A metacharacter describes another, more complex character. For example, \n is a metacharacter describing the nonprintable newline character.

Going back to the example, you know that

1234 matches 1234

This is, of course, obvious: Anything will match itself. However, you also know that \d matches any digit. By the transitive property of logic, you can substitute \d for the last digit. Thus, the pattern becomes

1234 should_match 123\d

Here you replace the last digit, 4, with the equivalent metacharacter, \d. If you run this pattern though the handy RX.java program, you can see that it does, in fact, continue to match. So far, so good. Actually, itís better than good: Now you have a pattern that will match not only 1234, but also any four-digit number beginning with the digits 123. Weíre getting closer.

NOTE   RX.java is a very short companion program for this book that you can obtain from Downloads section of the Apress Web site (http://www.apress.com). You can use this program to execute regular expression patterns against a candidate string.

Repeat the process on the third digit, so that 1234 should match      12\d\d, where you replace the 3 with the equivalent \d. Things are looking up. Not only does this match 1234, but also it matches any four-digit number beginning with the digits 12.

You can see where this is going. Eventually, youíll create the pattern \d\d\d\d, which will match any four digits. This isnít the most succinct pattern, but itís sufficient to meet the stated need: It matches any four-digit number.

The point here is that you can, in principle, sometimes work backward from a specific match to create the pattern you need. Of course, this is just a technique, and it wonít work for all situations. However, itís a good method to put into your regex bag of tricks.

The Push Technique

Another technique that Iíve found to be helpful in writing regular expression patterns is the push technique. The push technique builds on previous work by either adding to it, subtracting from it, or modifying its scope until itís useful.

Instead of working with a specific matching token, as in the pull technique, this approach takes a preexisting regular expression thatís similar to the one you need and modifies it until it does the required job. That is, the regular expression is pushed into another functionality, hence the name.

For example, say you want a regex pattern that matches five digits. Based on the previous example, you know that \d\d\d\d will match any four digits. Thus, the process of finding a match for a five-digit match is as easy as appending another \d to the previous pattern. The answer, of course, is the pattern \d\d\d\d\d.

As you progress though this chapter, youíll learn that these arenít the most elegant representations of the four-digit and five-digit matching patterns you could come up with, but theyíre perfectly legitimate solutions, and theyíre reasonably derived. That process of derivation is the important point to take away from this discussion.

The Composition Technique

The composition technique does exactly what its name implies: It puts together various patterns to form a new whole. That is, itís the composition of a new pattern by using other patterns. This is distinct from the push technique in that patterns arenít modified; rather, theyíre simply appended.

Assume that you need to create a pattern that will match United States zip codes, which consist of five digits, followed by a hyphen character, followed by four digits. Based on the work youíve already done, this pattern is very easy to create. You know that four digits match \d\d\d\d, that a hyphen matches itself, and that five digits match \d\d\d\d\d. Composing these into a single pattern yields the pattern \d\d\d\d\d-\d\d\d\d\d.

Again, this isnít the most elegant and concise representation for a zip code, and it isnít very permissive (what about five-digit zip codes? What if there are spaces between the hyphen and the digits? What if there is no hyphen, just a space?), but it does meet the stated requirement.

NOTE   As with most software problems, you can often find the solution to a regex conundrum by clarifying the requirements.

Introducing the Regular Expression Syntax

The following sections introduce Javaís regular expression syntax. For the sake of clarity, the material is grouped into small, logical units, followed by a brief example that demonstrates usage. The examples progress from those that emphasize the role of the Pattern to those that start to rely on the Matcher more.

NOTE   Please keep in mind that these are working examples only. Weíre not ready to bulletproof our code yet.

Reading Patterns

The regex language contains metacharacters designed to help you describe search criteria. Because reading a pattern without being aware of these characters can be a bewildering experience, Iíve listed the most popular metacharacters are in Table 1-1.

These characters are effectively reserved words, just as new is a reserved word in Java. They serve as building blocks for more complex search criteria. I discuss this in more detail soon.

Table 1-1. Basic Regex Delimiter Characters

Pattern

Name

Description

.

Period

Matches any character.

$

Dollar sign

Matches the end of a line.

^

Carat

Matches the beginning of a line.

{

Opening curly bracket

Defines a range opening.

[

Opening bracket

Defines a character class opening.

(

Opening parenthesis

Defines the beginning of a group.

|

Pipe symbol

A symbol meaning OR

}

Closing curly bracket

Defines a range closing.

]

Closing bracket

Defines a character class closing.

)

Closing parenthesis

Defines the closing of a group.

*

Asterisk

The preceding is repeated zero or more times.

+

Plus sign

The preceding is repeated one or more times.

?

Question mark

The preceding is repeated zero or one time.

\

Backward slash

The following is not to be treated as a metacharacter.

If youíre reading a character in a regex pattern and it isnít one of characters listed in Table 1-1, then the character youíre reading probably stands for the character it represents. For example, Table 1-2 shows how the pattern hello* should be read.

Table 1-2. The Pattern hello*

Letter

Description

h

The character h

e

Followed by the character e

l

Followed by the character l

l

Followed by the character l

o

Followed by the character o

*

Followed by a metacharacter that, in this case, means o should be repeated zero or more times

* In English: Look for the word hell, followed by any number of trailing o characters.

If you actually need to find one of these characters, such as          the *character, simply append the character youíre searching for to a \character. For example, to find the *character, use \*.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials