SunQuest
 
       Java
  Home arrow Java arrow Page 2 - Regular Expressions
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Actuate Whitepapers 
VeriSign Whitepapers 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Regular Expressions
By: Apress Publishing
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 10
    2005-07-28

    Table of Contents:
  • Regular Expressions
  • Creating Patterns
  • Common and Boundary Characters
  • Character Classes
  • Back References
  • Integrating Java with Regular Expressions
  • Confirming Name Formats Example
  • Finding Duplicate Words Example
  • Regular Expression Operations
  • Search and Replace
  • Comparing Regex and Perl

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
    Iron Speed
     
    ADVERTISEMENT

    At the virtual BlackBerry Technical Seminar 2008, you can ask your development questions directly of Research In Motion® (RIM) experts, and take advantage of learning opportunities designed uniquely for BlackBerry solution developers. Register Today!

    Regular Expressions - Creating Patterns


    (Page 2 of 11 )

    This section presents some simple techniques for writing your own regular expressions. I think of them as the push, the pull, and the composition. As in the Japanese martial art Judo, if your opponent is pushing against you, you pull him. If he’s pulling away, you push. If those techniques don’t work, you compose him into a pretzel.

    Similarly, writing a regular expression will sometimes seem impervious to certain approaches, but very susceptible to others. The methods I describe in the following sections are only simple techniques for writing patterns. If you haven’t already done so, you’ll soon cultivate your own bag of regex tricks. You may even develop pet names for them.

    The Pull Technique

    One of the most successful ways to create regular expressions consists of taking an exact match and then slowly morphing it into a generic regular expression that matches the original. I think of this as the pull technique, because I’m slowly pulling the regular expression out of the exact match.

    For example, imagine that you want to create a pattern to match four-digit numbers. Thus, 1234 would be a match, but 123 would not, and neither would 12345 or ABCD.

    For the sake of this example, you’ll need to introduce a single regular expression metacharacter, \d, that will match any digit ranging from 0 to 9.

    NOTE  A metacharacter describes another, more complex character. For example, \n is a metacharacter describing the nonprintable newline character.

    Going back to the example, you know that

    1234 matches 1234

    This is, of course, obvious: Anything will match itself. However, you also know that \d matches any digit. By the transitive property of logic, you can substitute \d for the last digit. Thus, the pattern becomes

    1234 should_match 123\d

    Here you replace the last digit, 4, with the equivalent metacharacter, \d. If you run this pattern though the handy RX.java program, you can see that it does, in fact, continue to match. So far, so good. Actually, it’s better than good: Now you have a pattern that will match not only 1234, but also any four-digit number beginning with the digits 123. We’re getting closer.

    NOTE   RX.java is a very short companion program for this book that you can obtain from Downloads section of the Apress Web site (http://www.apress.com). You can use this program to execute regular expression patterns against a candidate string.

    Repeat the process on the third digit, so that 1234 should match      12\d\d, where you replace the 3 with the equivalent \d. Things are looking up. Not only does this match 1234, but also it matches any four-digit number beginning with the digits 12.

    You can see where this is going. Eventually, you’ll create the pattern \d\d\d\d, which will match any four digits. This isn’t the most succinct pattern, but it’s sufficient to meet the stated need: It matches any four-digit number.

    The point here is that you can, in principle, sometimes work backward from a specific match to create the pattern you need. Of course, this is just a technique, and it won’t work for all situations. However, it’s a good method to put into your regex bag of tricks.

    The Push Technique

    Another technique that I’ve found to be helpful in writing regular expression patterns is the push technique. The push technique builds on previous work by either adding to it, subtracting from it, or modifying its scope until it’s useful.

    Instead of working with a specific matching token, as in the pull technique, this approach takes a preexisting regular expression that’s similar to the one you need and modifies it until it does the required job. That is, the regular expression is pushed into another functionality, hence the name.

    For example, say you want a regex pattern that matches five digits. Based on the previous example, you know that \d\d\d\d will match any four digits. Thus, the process of finding a match for a five-digit match is as easy as appending another \d to the previous pattern. The answer, of course, is the pattern \d\d\d\d\d.

    As you progress though this chapter, you’ll learn that these aren’t the most elegant representations of the four-digit and five-digit matching patterns you could come up with, but they’re perfectly legitimate solutions, and they’re reasonably derived. That process of derivation is the important point to take away from this discussion.

    The Composition Technique

    The composition technique does exactly what its name implies: It puts together various patterns to form a new whole. That is, it’s the composition of a new pattern by using other patterns. This is distinct from the push technique in that patterns aren’t modified; rather, they’re simply appended.

    Assume that you need to create a pattern that will match United States zip codes, which consist of five digits, followed by a hyphen character, followed by four digits. Based on the work you’ve already done, this pattern is very easy to create. You know that four digits match \d\d\d\d, that a hyphen matches itself, and that five digits match \d\d\d\d\d. Composing these into a single pattern yields the pattern \d\d\d\d\d-\d\d\d\d\d.

    Again, this isn’t the most elegant and concise representation for a zip code, and it isn’t very permissive (what about five-digit zip codes? What if there are spaces between the hyphen and the digits? What if there is no hyphen, just a space?), but it does meet the stated requirement.

    NOTE   As with most software problems, you can often find the solution to a regex conundrum by clarifying the requirements.

    Introducing the Regular Expression Syntax

    The following sections introduce Java’s regular expression syntax. For the sake of clarity, the material is grouped into small, logical units, followed by a brief example that demonstrates usage. The examples progress from those that emphasize the role of the Pattern to those that start to rely on the Matcher more.

    NOTE   Please keep in mind that these are working examples only. We’re not ready to bulletproof our code yet.

    Reading Patterns

    The regex language contains metacharacters designed to help you describe search criteria. Because reading a pattern without being aware of these characters can be a bewildering experience, I’ve listed the most popular metacharacters are in Table 1-1.

    These characters are effectively reserved words, just as new is a reserved word in Java. They serve as building blocks for more complex search criteria. I discuss this in more detail soon.

    Table 1-1. Basic Regex Delimiter Characters

    Pattern

    Name

    Description

    .

    Period

    Matches any character.

    $

    Dollar sign

    Matches the end of a line.

    ^

    Carat

    Matches the beginning of a line.

    {

    Opening curly bracket

    Defines a range opening.

    [

    Opening bracket

    Defines a character class opening.

    (

    Opening parenthesis

    Defines the beginning of a group.

    |

    Pipe symbol

    A symbol meaning OR

    }

    Closing curly bracket

    Defines a range closing.

    ]

    Closing bracket

    Defines a character class closing.

    )

    Closing parenthesis

    Defines the closing of a group.

    *

    Asterisk

    The preceding is repeated zero or more times.

    +

    Plus sign

    The preceding is repeated one or more times.

    ?

    Question mark

    The preceding is repeated zero or one time.

    \

    Backward slash

    The following is not to be treated as a metacharacter.

    If you’re reading a character in a regex pattern and it isn’t one of characters listed in Table 1-1, then the character you’re reading probably stands for the character it represents. For example, Table 1-2 shows how the pattern hello* should be read.

    Table 1-2. The Pattern hello*

    Letter

    Description

    h

    The character h

    e

    Followed by the character e

    l

    Followed by the character l

    l

    Followed by the character l

    o

    Followed by the character o

    *

    Followed by a metacharacter that, in this case, means o should be repeated zero or more times

    * In English: Look for the word hell, followed by any number of trailing o characters.

    If you actually need to find one of these characters, such as          the *character, simply append the character you’re searching for to a \character. For example, to find the *character, use \*.

    More Java Articles
    More By Apress Publishing


     

    Buy this book now. This article is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...

    Iron Speed

    Iron Speed





    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway