Home arrow Java arrow Page 2 - Advanced Regex
JAVA

Advanced Regex


Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 18
July 07, 2005
TABLE OF CONTENTS:
  1. · Advanced Regex
  2. · Noncapturing Subgroups
  3. · Greedy Qualifiers
  4. · Reluctant Qualifiers
  5. · Understanding Lookarounds
  6. · Zen and the Art of Efficient Expressions
  7. · Summary

print this article
SEARCH DEVARTICLES

Advanced Regex - Noncapturing Subgroups
(Page 2 of 7 )

There may be times when you need to define a group, but you don’t want that group to be captured—you simply want to treat it like a single logical entity. The major advantage of using these noncapturing groups is that they’re less memory intensive because they don’t require the regex engine to keep track of the matching parts.

Consider the pattern (\w)(\d\d)(\w+). Specifically, if you don't need access to the trailing (\w+), you can optimize a bit.

To mark a group as noncapturing, you simply follow the opening parameters of that group with the characters ?:. That is, you can write the expression as (\w)(\d\d)(?:\w+). Notice that the only difference between the original expression, (\w)(\d\d)(\w+), and the new expression, (\w)(\d\d)(?:\w+), is the use of the ?: that immediately precedes the last group, (\w+).

The most common use of noncapturing groups is for the sake of logical separation. For example, say you need to find out what kind of morning a person is having. You’ll accept good morning, bad morning, terrible morning, great morning, and so on. For the sake of clarity, you write the expression as (good|bad|terrible|great) morning. That is, you want to treat the various kinds of mornings as a single logical unit.

However, say you don’t need to capture the type of morning, because you’re not going to be using it for anything—you just want to know it’s there. You modify your expression to (?:good|bad|terrible|great) morning. Specifically, you insert ?: just inside the group definition, after the opening parenthesis of the group. This gives you the ability to treat the various kinds of mornings as a single logical unit, but it doesn’t waste memory capturing the description.

NOTE  To make a group noncapturing, insert ?: inside the opening parenthesis of the group.

An added issue in working with noncapturing groups is that they aren’t counted, as far as group indexing is concerned. This makes perfect sense, as you are, in effect, telling the regex engine that you aren’t interested in these groups. Thus, why should the regex track them or provide a mechanism that allows you to refer to them? After all, you explicitly told the regex engine that you weren’t interested in doing so.

So for the pattern (?:\w)(\d), group(0) is the entire pattern, namely \w\d, and group(1) is (\d). Notice that (?:\w) is not group(1), as it normally would be, because (?:\w) is a noncapturing group; it’s preceded by ?:. Listing 3-3 demonstrates the use of a simple noncapturing subgroup.

Listing 3-3. Working with Noncapturing Subgroups

import java.util.regex.*;
public class NonCapturingGroupExample{
 
public static void main(String args[]){
   
//define the pattern
   
String regex = "hello|hi|greetings|(?:good morning)";
   
//define the candidate strings
   
String candidate1 = "Tommy say hi to you";
   
String candidate2 = "Tommy say good morning to you";
   
//compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
//extract the first pattern
    Matcher matcher = pattern.matcher(candidate1);
    //show the number of groups
    System.out.println("GROUP COUNT:"+ matcher.groupCount());
   
if (matcher.find())System.out.println("GOT 1:"+candidate1);
   
//reuse the matcher, and check the second candidate string
    matcher.reset();
    matcher = pattern.matcher(candidate2);
   
//show the number of groups
    System.out.println("GROUP COUNT:"+ matcher.groupCount());
    if (matcher.find())
    System.out.println("GOT 2:" +candidate2);
  }
}

The output of this example is shown in Output 3-1.

Output 3-1. Output of NonCapturingGroupExample

--------------------------------------------------------------------GROUP COUNT:0
GOT 1:Tommy say hi to you
GROUP COUNT:0
GOT 2:Tommy say good morning to you

If you had used a capturing group, then the group count could have been 1. Although this may seem like a fairly innocuous issue, it could grow exponentially more complex, as the number of capturing groups grow.

Back References

Back references are the mechanism you use to access captured subgroups, while the regex engine is executing. When I say, while the regex engine is executing, you can think of this as the regex engine’s runtime. Thus, you can manipulate a subgroup from an earlier part of the match later on in the pattern.

For example, in Chapter 1, I discussed the pattern \b(\w+) \1\b to match repeated words. Here, when you use the \1, you’re asking the regex engine to refer back to it itself and insert whatever had matched the (\w+) part of it. Why the (\w+) part? Because that is the capturing group with the index of 1. Remember that capturing groups are counted from the rightmost parenthesis, starting with the index of 1.

For a given pattern with subgroups, Java offers three mechanisms for referring to the corresponding group matches. The first, and most object-oriented, mechanism is to use the various Matcher object methods. These include the Matcher.group, Matcher.start, Matcher.end, and Matcher.replaceAll methods, as discussed in Chapter 2. However, this mechanism doesn’t allow the regex pattern to refer back to itself during the regex runtime.

The second approach uses the \n nomenclature, where \n refers the nth capturing group, if it exists. This deserves a bit of explanation. Specifically, what does “if it exists” mean? For an answer, consider the simple regex pattern (\w)(\d)(\w) as applied to W3C. You could refer to group 0 by using the regex \0, group 1 by using \1, group 2 by using \2, and group 3 by using \3. However, \5 would be meaningless because there’s no group 5 in the pattern, as would a reference to \33.

Well, yes and no regarding group 33. Although the regex engine knows that there is no group 5, the latter example, group 33, is open to interpretation. That is, the regex engine could, and does, decide that you meant group 3, followed by character 3. Thus, examining the back reference \33 from the preceding example would yield C3: C followed by the character3. As shown previously, this mechanism does allow the regex pattern to refer back to itself during runtime.

Finally, there is a third way to refer to back references. Three replacement methods on the Matcher object, appendReplacement, ReplaceAll, and ReplaceFirst, as well as the String methods replaceFirst and replaceAll, also allow access to the captured back references by using the $n nomenclature, in which n represents the index of the group in question. Like the \n pattern discussed at the beginning of this section, use of $n will prompt the regex engine to take the most liberal interpretation of the pattern possible in order to facilitate a match.

Thus for the pattern (\w)(\d)(\w), using $33 will prompt the regex engine to assume you meant group 3 followed by the character 3. This is demonstrated in Listing 3-4.

Listing 3-4. Working with Back References

import java.util.regex.*;
public class ReplaceExample{
 
public static void main(String args[]){
    //define the pattern
    String regex = "(\\w)(\\d)(\\w+)";
   
//compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
//define the candidate string
    String candidate = "X99SuperJava";
   
//extract a matcher for the candidate string
    Matcher matcher = pattern.matcher(candidate);
   
//return a new string that has replaced
    //every matching part of the candidate string
    //with whatever was found in the third group,
    //followed by the digit three
    String tmp = matcher.replaceAll("$33");
    //returns C3
    System.out.println("REPLACEMENT: " + tmp);
    //notice that the original candidate string
    //is unchanged, as expected. After all, Strings
    //are immutable objects in Java.
    //returns W3C
    System.out.println("ORIGINAL: " + candidate);
 
}
}

It’s important to be careful when you’re working with back references. You could be asking the regex engine to do things you had no idea you were asking for, and that, in turn, could cost you in terms of efficiency and/or correctness.

One final word of warning: Calling back references for a group that doesn’t exist will cause an IndexOutOfBoundsException to be thrown. Make sure your back references exist before you refer to them.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials