Java
  Home arrow Java arrow Page 2 - Advanced Regex
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Advanced Regex
By: Apress Publishing
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 8
    2005-07-07

    Table of Contents:
  • Advanced Regex
  • Noncapturing Subgroups
  • Greedy Qualifiers
  • Reluctant Qualifiers
  • Understanding Lookarounds
  • Zen and the Art of Efficient Expressions
  • Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Advanced Regex - Noncapturing Subgroups


    (Page 2 of 7 )

    There may be times when you need to define a group, but you don’t want that group to be captured—you simply want to treat it like a single logical entity. The major advantage of using these noncapturing groups is that they’re less memory intensive because they don’t require the regex engine to keep track of the matching parts.

    Consider the pattern (\w)(\d\d)(\w+). Specifically, if you don't need access to the trailing (\w+), you can optimize a bit.

    To mark a group as noncapturing, you simply follow the opening parameters of that group with the characters ?:. That is, you can write the expression as (\w)(\d\d)(?:\w+). Notice that the only difference between the original expression, (\w)(\d\d)(\w+), and the new expression, (\w)(\d\d)(?:\w+), is the use of the ?: that immediately precedes the last group, (\w+).

    The most common use of noncapturing groups is for the sake of logical separation. For example, say you need to find out what kind of morning a person is having. You’ll accept good morning, bad morning, terrible morning, great morning, and so on. For the sake of clarity, you write the expression as (good|bad|terrible|great) morning. That is, you want to treat the various kinds of mornings as a single logical unit.

    However, say you don’t need to capture the type of morning, because you’re not going to be using it for anything—you just want to know it’s there. You modify your expression to (?:good|bad|terrible|great) morning. Specifically, you insert ?: just inside the group definition, after the opening parenthesis of the group. This gives you the ability to treat the various kinds of mornings as a single logical unit, but it doesn’t waste memory capturing the description.

    NOTE  To make a group noncapturing, insert ?: inside the opening parenthesis of the group.

    An added issue in working with noncapturing groups is that they aren’t counted, as far as group indexing is concerned. This makes perfect sense, as you are, in effect, telling the regex engine that you aren’t interested in these groups. Thus, why should the regex track them or provide a mechanism that allows you to refer to them? After all, you explicitly told the regex engine that you weren’t interested in doing so.

    So for the pattern (?:\w)(\d), group(0) is the entire pattern, namely \w\d, and group(1) is (\d). Notice that (?:\w) is not group(1), as it normally would be, because (?:\w) is a noncapturing group; it’s preceded by ?:. Listing 3-3 demonstrates the use of a simple noncapturing subgroup.

    Listing 3-3. Working with Noncapturing Subgroups

    import java.util.regex.*;
    public class NonCapturingGroupExample{
     
    public static void main(String args[]){
       
    //define the pattern
       
    String regex = "hello|hi|greetings|(?:good morning)";
       
    //define the candidate strings
       
    String candidate1 = "Tommy say hi to you";
       
    String candidate2 = "Tommy say good morning to you";
       
    //compile the pattern
        Pattern pattern = Pattern.compile(regex);
       
    //extract the first pattern
        Matcher matcher = pattern.matcher(candidate1);
        //show the number of groups
        System.out.println("GROUP COUNT:"+ matcher.groupCount());
       
    if (matcher.find())System.out.println("GOT 1:"+candidate1);
       
    //reuse the matcher, and check the second candidate string
        matcher.reset();
        matcher = pattern.matcher(candidate2);
       
    //show the number of groups
        System.out.println("GROUP COUNT:"+ matcher.groupCount());
        if (matcher.find())
        System.out.println("GOT 2:" +candidate2);
      }
    }

    The output of this example is shown in Output 3-1.

    Output 3-1. Output of NonCapturingGroupExample

    --------------------------------------------------------------------GROUP COUNT:0
    GOT 1:Tommy say hi to you
    GROUP COUNT:0
    GOT 2:Tommy say good morning to you

    If you had used a capturing group, then the group count could have been 1. Although this may seem like a fairly innocuous issue, it could grow exponentially more complex, as the number of capturing groups grow.

    Back References

    Back references are the mechanism you use to access captured subgroups, while the regex engine is executing. When I say, while the regex engine is executing, you can think of this as the regex engine’s runtime. Thus, you can manipulate a subgroup from an earlier part of the match later on in the pattern.

    For example, in Chapter 1, I discussed the pattern \b(\w+) \1\b to match repeated words. Here, when you use the \1, you’re asking the regex engine to refer back to it itself and insert whatever had matched the (\w+) part of it. Why the (\w+) part? Because that is the capturing group with the index of 1. Remember that capturing groups are counted from the rightmost parenthesis, starting with the index of 1.

    For a given pattern with subgroups, Java offers three mechanisms for referring to the corresponding group matches. The first, and most object-oriented, mechanism is to use the various Matcher object methods. These include the Matcher.group, Matcher.start, Matcher.end, and Matcher.replaceAll methods, as discussed in Chapter 2. However, this mechanism doesn’t allow the regex pattern to refer back to itself during the regex runtime.

    The second approach uses the \n nomenclature, where \n refers the nth capturing group, if it exists. This deserves a bit of explanation. Specifically, what does “if it exists” mean? For an answer, consider the simple regex pattern (\w)(\d)(\w) as applied to W3C. You could refer to group 0 by using the regex \0, group 1 by using \1, group 2 by using \2, and group 3 by using \3. However, \5 would be meaningless because there’s no group 5 in the pattern, as would a reference to \33.

    Well, yes and no regarding group 33. Although the regex engine knows that there is no group 5, the latter example, group 33, is open to interpretation. That is, the regex engine could, and does, decide that you meant group 3, followed by character 3. Thus, examining the back reference \33 from the preceding example would yield C3: C followed by the character3. As shown previously, this mechanism does allow the regex pattern to refer back to itself during runtime.

    Finally, there is a third way to refer to back references. Three replacement methods on the Matcher object, appendReplacement, ReplaceAll, and ReplaceFirst, as well as the String methods replaceFirst and replaceAll, also allow access to the captured back references by using the $n nomenclature, in which n represents the index of the group in question. Like the \n pattern discussed at the beginning of this section, use of $n will prompt the regex engine to take the most liberal interpretation of the pattern possible in order to facilitate a match.

    Thus for the pattern (\w)(\d)(\w), using $33 will prompt the regex engine to assume you meant group 3 followed by the character 3. This is demonstrated in Listing 3-4.

    Listing 3-4. Working with Back References

    import java.util.regex.*;
    public class ReplaceExample{
     
    public static void main(String args[]){
        //define the pattern
        String regex = "(\\w)(\\d)(\\w+)";
       
    //compile the pattern
        Pattern pattern = Pattern.compile(regex);
       
    //define the candidate string
        String candidate = "X99SuperJava";
       
    //extract a matcher for the candidate string
        Matcher matcher = pattern.matcher(candidate);
       
    //return a new string that has replaced
        //every matching part of the candidate string
        //with whatever was found in the third group,
        //followed by the digit three
        String tmp = matcher.replaceAll("$33");
        //returns C3
        System.out.println("REPLACEMENT: " + tmp);
        //notice that the original candidate string
        //is unchanged, as expected. After all, Strings
        //are immutable objects in Java.
        //returns W3C
        System.out.println("ORIGINAL: " + candidate);
     
    }
    }

    It’s important to be careful when you’re working with back references. You could be asking the regex engine to do things you had no idea you were asking for, and that, in turn, could cost you in terms of efficiency and/or correctness.

    One final word of warning: Calling back references for a group that doesn’t exist will cause an IndexOutOfBoundsException to be thrown. Make sure your back references exist before you refer to them.

    More Java Articles
    More By Apress Publishing


       · I have gone through various explanations online for greedy, possessive and reluctant...
     

    Buy this book now. This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 2 Hosted by Hostway
    Stay green...Green IT