Home arrow Java arrow Page 3 - Advanced Regex
JAVA

Advanced Regex


Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 18
July 07, 2005
TABLE OF CONTENTS:
  1. · Advanced Regex
  2. · Noncapturing Subgroups
  3. · Greedy Qualifiers
  4. · Reluctant Qualifiers
  5. · Understanding Lookarounds
  6. · Zen and the Art of Efficient Expressions
  7. · Summary

print this article
SEARCH DEVARTICLES

Advanced Regex - Greedy Qualifiers
(Page 3 of 7 )

You might have noticed that the third subgroup of the pattern (\w)(\d\d)(\w+), namely (\w+), actually provided the regex engine with some discretion. Given the candidate X99SuperJava, the subexpression(w+) only had to match one or more word characters to meet its obligation, yet it chose to match them all. That is, it chose to match the candidate SuperJava when it could have just matched S, Su, Sup, and so on. After all, these also meet the requirement of being one or more word characters. Why did the engine decide to match as much as possible?

It did so because it is, in a word, greedy. The nature of the regex engine is to match as much as it possibly can, so long as that match doesn’t interfere with another matching subexpression somewhere else in the pattern. Thus, it matches the entire candidate SuperJava. Table 3-1 presents the Java regex greedy qualifiers for your reference.

Table 3-1. Greedy Qualifiers Regex

Description

?

The preceding is repeated once or not at all.

*

The preceding is repeated zero or more times.

+

The preceding is repeated one or more times.

{n}

The preceding is repeated exactly n times.

{n,}

The preceding is repeated at least ntimes.

{n,m}

The preceding is repeated at least n times, but no more than m

times. This includes mrepetitions.

Greedy matching has an interesting behavioral pattern that I think of as greedy-generous.The leftmost part of the pattern, the first (\w), attempts to match as much as possible. When it can’t match anymore, the next part of the pattern is evaluated. This is the greedy part.

If that second part of the pattern fails to find any matches, and then the first matching pattern starts to slowly release characters that it has already collected, thus providing the second part of the pattern with more opportunities to match. This is the generous part of the behavior.

At this point, one of two things can happen. The first alternative is that the latter group can finally achieve a match, in which case the first group stops releasing characters. The second alternative is that the latter group can fail to match, even with the characters made available to it from the earlier group. If this happens, then the group that was releasing characters, in effect, collects those released characters again, and the regex engine goes on. If a later subexpression again fails to match, then the process is repeated.

So, what does all of this mean to you? Well, quite a bit, really. Consider the regex pattern (\w+)(\d\d)(\w+), which is almost identical to the pattern (\w)(\d\d)(\w+) except that first pattern uses + after the first \w, thus forming the group (\w+). That little + can make a huge difference in terms of efficiency, and one that you might not notice in casual use, because the result of applying the new pattern against the candidate X99SuperJava doesn’t change.

Listing 3-5 examines what actually happens when the pattern is evaluated.

Listing 3-5. Greedy Qualifier Example

import java.util.regex.*;
public class GreedyExample{
  public static void main(String args[]){
   
//define the pattern
   
String regex = "(\\w+)(\\d\\d)(\\w+)";
   
//compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
//define the candidate string
   
String candidate = "X99SuperJava";
   
//extract a matcher for the candidate string
    Matcher matcher = pattern.matcher(candidate);
   
matcher.find();
   
//extract the matching groups
   
System.out.println(matcher.group(1));//returns 'X'
   
System.out.println(matcher.group(2));//returns '99'
   
System.out.println(matcher.group(3));//returns SuperJava
 
}
}

When group(1) runs, (\w+) examines every character in the candidate String X99SuperJava. That is, X is explicitly considered, passes inspection, and is put into the matching bag for this group. Because this pattern is greedy and has + after \w, it continues. Next, 9 is explicitly considered and passes inspection. Then, the next 9 is considered. This continues until the entire String X99SuperJava is consumed.

After group(1) is satisfied, group(2), namely (\d\d), gets an opportunity. Because group(2) is unable to match anything at all, group(1) releases the a character at the end of X99SuperJava. The a character is considered by group(2), found not be a digit, and considered not to be sufficient. Thus, group(1) releases the v character. group(2) inspects it, finds it lacking, and rejects it. Thus, group(1) releases the a character immediately before the v character in X99SuperJava. This continues until group(1) has released every character except X. Finally, the release of the two 9 characters allow group(2) to match. Now group(1) has X and group(2) has 99.

Finally, group(3) gets an opportunity to run. It starts examining the candidate String X99SuperJava at the point immediately following the second 9 character. And because it’s greedy, it matches the entire String SuperJava.

Thus, the two patterns (\w)(\d\d)(\w+) and (\w+)(\d\d)(\w+)produce exactly the same result when applied to the String X99SuperJava but at vastly different efficiency costs. Although this may be insignificant when you’re dealing with a small String, it’s very significant when you’re parsing a directory full of files; it could mean the difference between your application working and it running out of memory.

Possessive Qualifiers

Given the context of greedy qualifiers, possessive qualifiers are relatively easy to understand. They’re unique to Java, though they’ll probably be adopted by other regex languages in time. Simply put, they’re greedy, but never generous. Adding a plus sign (+) after an existing greedy qualifier forms a possessive qualifier. Thus, \d+ becomes \d++, \d{n,m} becomes \d{n,m}+, and so on.

Consider the previous example in which the pattern (\w+)(\d\d)(\w+) is applied to the candidate string X99SuperJava. I established earlier that the regex engine, when faced with the group (\w+), would attempt to match as many characters as possible. I also established that it would release those matches if such a release would help a later group achieve a match.

If you don’t want the group to release its matches to help later groups match, however, then you would use possessive qualifiers simply by following your last expression with an additional plus sign. You form the pattern (\w++)(\d\d)(\w+)—that is, you turn \w+ into \w++.

When the pattern is first run, group(1), namely (\w++), consumes every character in the candidate X99SuperJava. However, because of the existence of the second plus sign, the engine refuses to release any of those matching characters to help the following groups match. You find that group(1) matches, and group(2) and group(3) don’t—the regex as whole doesn’t match. Listing 3-6 demonstrates this.

Listing 3-6. Possessive Qualifier Example

import java.util.regex.*;
public class PossesiveExample{
 
public static void main(String args[]){
    //define the pattern
    String regex = "(\\w++)(\\d\\d)(\\w+)";
    //compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
//define the candidate string
    String candidate = "X99SuperJava";
   
//extract a matcher for the candidate string
    Matcher matcher = pattern.matcher(candidate);
   
if (matcher.find()){
      System.out.println("GROUP 0:" +
      matcher.group(0));
      System.out.println("GROUP 1:" +
      matcher.group(1));
      System.out.println("GROUP 2:" + matcher.group(2)); 
      System.out.println("GROUP 3:" + matcher.group(3));
   
}
    else{
      System.out.println("NO MATCHES" );
    }
   
System.out.println("Done");
  }
}


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials