Home arrow Java arrow Advanced Regex

Advanced Regex

Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 18
July 07, 2005
  1. · Advanced Regex
  2. · Noncapturing Subgroups
  3. · Greedy Qualifiers
  4. · Reluctant Qualifiers
  5. · Understanding Lookarounds
  6. · Zen and the Art of Efficient Expressions
  7. · Summary

print this article

Advanced Regex
(Page 1 of 7 )

THIS CHAPTER EXPLORES some of the more advanced features of regular expressions in J2SE. The goal is to provide a point of reference for the more complex regex tools and concepts available to Java developers. This chapter should be a resource you can come back to when you need a refresher on a J2SE regex concept.

Of course, there’s no learning tool as useful as actually writing code, so I encourage you to try out these concepts on your own. This chapter introduces a variety of concepts, including groups, subgroups, noncapturing groups, greedy qualifiers, positive qualifiers, reluctant qualifiers, positive lookaheads, negative lookaheads, positive lookbehinds, and negative lookbehinds. The final section of this chapter focuses on increasing the efficiency of your regular expressions.

NOTE The examples in this chapter are intentionally simple so as to clearly illustrate the mechanisms being discussed. More complex examples, such as those used professionally, are explored in Chapter 5 and in the Appendixes.

Understanding Groups

As I explained in Chapter 2, a group is simply a sequence of characters that describes a regex pattern. Thus,\w\d is a group, because there are two characters,\w and \d, and they’re in a sequence. This is an implicit group and thus trivial, because most groups, as such, are explicitly surrounded by parentheses. In Java regular expressions, every pattern has groups, and they’re indexed. Thus, the pattern \w\d has a single group, namely itself, and the index of that group is, of course, 0.

Groups are described in the Pattern but realized in the Matcher. Conceptually, this is similar to how SQL works, where queries are described in the SQL query, but the matching parts are extracted from the ResultSet. Thus, when you describe the pattern \w\d, you might extract the matching candidate A9 from the candidate A9 is my favorite. For example, if the group is described as

Pattern p = Pattern.compile("\\w\\d");

and the candidate String is

String candidate = "A9 is my favorite";

you define a Matcher for this candidate String:

Matcher matcher = p.matcher(candidate);

Assuming that the Matcher.find() method has already been called, then calling Matcher.group(0) returns the part of the candidate String that matches the entire pattern, as follows:

String tmp = matcher.group(0);

Thus, the Matcher.group(0) method is a bit of a misnomer. It doesn’t actually extract the group; it extracts the part of the candidate String that matches that group. This is a subtle but important difference. The full example follows in Listing 3-1.

Listing 3-1. Working with Groups

import java.util.regex.*;
public class SimpleGroupExample{
public static void main(String args[]){
    //the original pattern is always group 0
    Pattern p = Pattern.compile(\\w\\d);
    String candidate = "A9 is my favorite";
//if there is a match, extract that part of
    //the candidate string that matches group(0)
    Matcher matcher = p.matcher(candidate);
//OUTPUT is 'A9'
if (matcher.find()){
        String tmp = matcher.group(0);

That works perfectly when you need the whole pattern. But what about cases in which you need subsections of that pattern? How do you extract those? That’s where subgroups come to the rescue.

Understanding Subgroups

Just as a pattern can have groups, so can it have subgroups. Subgroups are simply smaller groups within the larger whole. They’re separated from the original group, and from each other, by being surrounded by parentheses.

In the example in the preceding section, in order to be able to refer explicitly to the digit \d, you modify the pattern to \w(\d). Here, \w\d is group(0) and (\d) is group(1). Listing 3-2 demonstrates the use of a subgroup to extract the part of the candidate that matches the digit \d.

Listing 3-2. Working with Subgroups

import java.util.regex.*;
public class SimpleSubGroupExample{
public static void main(String args[]){
    //the original pattern is always group 0
    Pattern p = Pattern.compile("\\w(\\d)");
    String candidate = "A9 is my favorite";
//if there is a match, extract the parts that
    Matcher matcher = p.matcher(candidate);
    if (matcher.find()){
//Extract 'A9', which matches group(0), which is           //always the entire pattern itself.
      String tmp = matcher.group(0);
      System.out.println(tmp); //tmp is 49
//extract part of the candidate string that matches
      //group(1): Namely, the '9' which follows the 'A'
      tmp = matcher.group(1); //tmp is 9

Listing 3-2 allows you to extract parts of the candidate String that match the entire expression. It also allows you to extract subsections of that matching section. Thus, you can extract 9 from the matching region A9 because 9 matches group(1). That is, the regex engine stores the 9 when it matches A9, because you defined the subgroup(\d).

Accessing Subgroups

So how does all of this actually work? Are there little fairies running around under the hood of the regex engine, keeping track of all these groups? Well, yes and no.

Although there are no fairies under the hood that I know of, the regex engine is internally tracking all subgroups by putting the matching sections from the candidate String into memory. Thus, because you defined the pattern as \w(\d), the regex will keep track of any single digit when that digit is preceded by an alphanumeric or underscore character. That’s what the regex thinks you mean for it do when you put the expression (\d) in parentheses.

The engine provides access to these captured groups based on their numeric index. Captured groups are indexed from left to right, in the order of their opening parentheses, and group(0) always refers to the original expression in its entirety. Thus, in the preceding example, group(0) refers to the part of the candidate string that matches the entire expression \w(\d), whereas group(1) refers to the part of the expression that matches the (\d) part of the expression.

For example, if your pattern was (\w)(\d)(\w)(\w) and your candidate string was J2SE, then group 0 would have matched the entire candidate J2SE. Group 1 would have matchedJ, group 2 would have matched2, group 3 would have matchedS, and group 4 would have matched E.

Correspondingly, if your pattern stayed (\w)(\d)(\w)(\w) but your candidate string was R2D2, then group 0 would have matched the entire candidate R2D2. Group 1 would have matched R, group 2 would have matched 2, group 3 would have matched D, and group 4 would have matched 2.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials