Java
  Home arrow Java arrow Advanced Regex
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Advanced Regex
By: Apress Publishing
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 8
    2005-07-07

    Table of Contents:
  • Advanced Regex
  • Noncapturing Subgroups
  • Greedy Qualifiers
  • Reluctant Qualifiers
  • Understanding Lookarounds
  • Zen and the Art of Efficient Expressions
  • Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Advanced Regex


    (Page 1 of 7 )

    Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

    THIS CHAPTER EXPLORES some of the more advanced features of regular expressions in J2SE. The goal is to provide a point of reference for the more complex regex tools and concepts available to Java developers. This chapter should be a resource you can come back to when you need a refresher on a J2SE regex concept.

    Of course, there’s no learning tool as useful as actually writing code, so I encourage you to try out these concepts on your own. This chapter introduces a variety of concepts, including groups, subgroups, noncapturing groups, greedy qualifiers, positive qualifiers, reluctant qualifiers, positive lookaheads, negative lookaheads, positive lookbehinds, and negative lookbehinds. The final section of this chapter focuses on increasing the efficiency of your regular expressions.

    NOTE The examples in this chapter are intentionally simple so as to clearly illustrate the mechanisms being discussed. More complex examples, such as those used professionally, are explored in Chapter 5 and in the Appendixes.

    Understanding Groups

    As I explained in Chapter 2, a group is simply a sequence of characters that describes a regex pattern. Thus,\w\d is a group, because there are two characters,\w and \d, and they’re in a sequence. This is an implicit group and thus trivial, because most groups, as such, are explicitly surrounded by parentheses. In Java regular expressions, every pattern has groups, and they’re indexed. Thus, the pattern \w\d has a single group, namely itself, and the index of that group is, of course, 0.

    Groups are described in the Pattern but realized in the Matcher. Conceptually, this is similar to how SQL works, where queries are described in the SQL query, but the matching parts are extracted from the ResultSet. Thus, when you describe the pattern \w\d, you might extract the matching candidate A9 from the candidate A9 is my favorite. For example, if the group is described as

    Pattern p = Pattern.compile("\\w\\d");

    and the candidate String is

    String candidate = "A9 is my favorite";

    you define a Matcher for this candidate String:

    Matcher matcher = p.matcher(candidate);

    Assuming that the Matcher.find() method has already been called, then calling Matcher.group(0) returns the part of the candidate String that matches the entire pattern, as follows:

    String tmp = matcher.group(0);

    Thus, the Matcher.group(0) method is a bit of a misnomer. It doesn’t actually extract the group; it extracts the part of the candidate String that matches that group. This is a subtle but important difference. The full example follows in Listing 3-1.

    Listing 3-1. Working with Groups

    import java.util.regex.*;
    public class SimpleGroupExample{
     
    public static void main(String args[]){
        //the original pattern is always group 0
        Pattern p = Pattern.compile(\\w\\d);
        String candidate = "A9 is my favorite";
       
    //if there is a match, extract that part of
        //the candidate string that matches group(0)
        Matcher matcher = p.matcher(candidate);
       
    //OUTPUT is 'A9'
       
    if (matcher.find()){
            String tmp = matcher.group(0);
            System.out.println(tmp);
        }
      }
    }

    That works perfectly when you need the whole pattern. But what about cases in which you need subsections of that pattern? How do you extract those? That’s where subgroups come to the rescue.

    Understanding Subgroups

    Just as a pattern can have groups, so can it have subgroups. Subgroups are simply smaller groups within the larger whole. They’re separated from the original group, and from each other, by being surrounded by parentheses.

    In the example in the preceding section, in order to be able to refer explicitly to the digit \d, you modify the pattern to \w(\d). Here, \w\d is group(0) and (\d) is group(1). Listing 3-2 demonstrates the use of a subgroup to extract the part of the candidate that matches the digit \d.

    Listing 3-2. Working with Subgroups

    import java.util.regex.*;
    public class SimpleSubGroupExample{
     
    public static void main(String args[]){
        //the original pattern is always group 0
        Pattern p = Pattern.compile("\\w(\\d)");
        String candidate = "A9 is my favorite";
       
    //if there is a match, extract the parts that
        //match.
        Matcher matcher = p.matcher(candidate);
        if (matcher.find()){
         
    //Extract 'A9', which matches group(0), which is           //always the entire pattern itself.
          String tmp = matcher.group(0);
          System.out.println(tmp); //tmp is 49
         
    //extract part of the candidate string that matches
          //group(1): Namely, the '9' which follows the 'A'
          tmp = matcher.group(1); //tmp is 9
          System.out.println(tmp);
       
    }
      }
    }

    Listing 3-2 allows you to extract parts of the candidate String that match the entire expression. It also allows you to extract subsections of that matching section. Thus, you can extract 9 from the matching region A9 because 9 matches group(1). That is, the regex engine stores the 9 when it matches A9, because you defined the subgroup(\d).

    Accessing Subgroups

    So how does all of this actually work? Are there little fairies running around under the hood of the regex engine, keeping track of all these groups? Well, yes and no.

    Although there are no fairies under the hood that I know of, the regex engine is internally tracking all subgroups by putting the matching sections from the candidate String into memory. Thus, because you defined the pattern as \w(\d), the regex will keep track of any single digit when that digit is preceded by an alphanumeric or underscore character. That’s what the regex thinks you mean for it do when you put the expression (\d) in parentheses.

    The engine provides access to these captured groups based on their numeric index. Captured groups are indexed from left to right, in the order of their opening parentheses, and group(0) always refers to the original expression in its entirety. Thus, in the preceding example, group(0) refers to the part of the candidate string that matches the entire expression \w(\d), whereas group(1) refers to the part of the expression that matches the (\d) part of the expression.

    For example, if your pattern was (\w)(\d)(\w)(\w) and your candidate string was J2SE, then group 0 would have matched the entire candidate J2SE. Group 1 would have matchedJ, group 2 would have matched2, group 3 would have matchedS, and group 4 would have matched E.

    Correspondingly, if your pattern stayed (\w)(\d)(\w)(\w) but your candidate string was R2D2, then group 0 would have matched the entire candidate R2D2. Group 1 would have matched R, group 2 would have matched 2, group 3 would have matched D, and group 4 would have matched 2.

    More Java Articles
    More By Apress Publishing


       · I have gone through various explanations online for greedy, possessive and reluctant...
     

    Buy this book now. This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 4 Hosted by Hostway
    Stay green...Green IT