Advanced Regex - Greedy Qualifiers
(Page 3 of 7 )
You might have noticed that the third subgroup of the pattern (\w)(\d\d)(\w+), namely (\w+), actually provided the regex engine with some discretion. Given the candidate X99SuperJava, the subexpression(w+) only had to match one or more word characters to meet its obligation, yet it chose to match them all. That is, it chose to match the candidate SuperJava when it could have just matched S, Su, Sup, and so on. After all, these also meet the requirement of being one or more word characters. Why did the engine decide to match as much as possible?
It did so because it is, in a word, greedy. The nature of the regex engine is to match as much as it possibly can, so long as that match doesn’t interfere with another matching subexpression somewhere else in the pattern. Thus, it matches the entire candidate SuperJava. Table 3-1 presents the Java regex greedy qualifiers for your reference.
Table 3-1. Greedy Qualifiers Regex | Description |
? | The preceding is repeated once or not at all. |
* | The preceding is repeated zero or more times. |
+ | The preceding is repeated one or more times. |
{n} | The preceding is repeated exactly n times. |
{n,} | The preceding is repeated at least ntimes. |
{n,m} | The preceding is repeated at least n times, but no more than m times. This includes mrepetitions. |
Greedy matching has an interesting behavioral pattern that I think of as greedy-generous.The leftmost part of the pattern, the first (\w), attempts to match as much as possible. When it can’t match anymore, the next part of the pattern is evaluated. This is the greedy part.
If that second part of the pattern fails to find any matches, and then the first matching pattern starts to slowly release characters that it has already collected, thus providing the second part of the pattern with more opportunities to match. This is the generous part of the behavior.
At this point, one of two things can happen. The first alternative is that the latter group can finally achieve a match, in which case the first group stops releasing characters. The second alternative is that the latter group can fail to match, even with the characters made available to it from the earlier group. If this happens, then the group that was releasing characters, in effect, collects those released characters again, and the regex engine goes on. If a later subexpression again fails to match, then the process is repeated.
So, what does all of this mean to you? Well, quite a bit, really. Consider the regex pattern (\w+)(\d\d)(\w+), which is almost identical to the pattern (\w)(\d\d)(\w+) except that first pattern uses + after the first \w, thus forming the group (\w+). That little + can make a huge difference in terms of efficiency, and one that you might not notice in casual use, because the result of applying the new pattern against the candidate X99SuperJava doesn’t change.
Listing 3-5 examines what actually happens when the pattern is evaluated.
Listing 3-5. Greedy Qualifier Example
import java.util.regex.*;
public class GreedyExample{
public static void main(String args[]){
//define the pattern
String regex = "(\\w+)(\\d\\d)(\\w+)";
//compile the pattern
Pattern pattern = Pattern.compile(regex);
//define the candidate string
String candidate = "X99SuperJava";
//extract a matcher for the candidate string
Matcher matcher = pattern.matcher(candidate);
matcher.find();
//extract the matching groups
System.out.println(matcher.group(1));//returns 'X'
System.out.println(matcher.group(2));//returns '99'
System.out.println(matcher.group(3));//returns SuperJava
}
}
When group(1) runs, (\w+) examines every character in the candidate String X99SuperJava. That is, X is explicitly considered, passes inspection, and is put into the matching bag for this group. Because this pattern is greedy and has + after \w, it continues. Next, 9 is explicitly considered and passes inspection. Then, the next 9 is considered. This continues until the entire String X99SuperJava is consumed.
After group(1) is satisfied, group(2), namely (\d\d), gets an opportunity. Because group(2) is unable to match anything at all, group(1) releases the a character at the end of X99SuperJava. The a character is considered by group(2), found not be a digit, and considered not to be sufficient. Thus, group(1) releases the v character. group(2) inspects it, finds it lacking, and rejects it. Thus, group(1) releases the a character immediately before the v character in X99SuperJava. This continues until group(1) has released every character except X. Finally, the release of the two 9 characters allow group(2) to match. Now group(1) has X and group(2) has 99.
Finally, group(3) gets an opportunity to run. It starts examining the candidate String X99SuperJava at the point immediately following the second 9 character. And because it’s greedy, it matches the entire String SuperJava.
Thus, the two patterns (\w)(\d\d)(\w+) and (\w+)(\d\d)(\w+)produce exactly the same result when applied to the String X99SuperJava but at vastly different efficiency costs. Although this may be insignificant when you’re dealing with a small String, it’s very significant when you’re parsing a directory full of files; it could mean the difference between your application working and it running out of memory.
Possessive Qualifiers Given the context of greedy qualifiers, possessive qualifiers are relatively easy to understand. They’re unique to Java, though they’ll probably be adopted by other regex languages in time. Simply put, they’re greedy, but never generous. Adding a plus sign (+) after an existing greedy qualifier forms a possessive qualifier. Thus, \d+ becomes \d++, \d{n,m} becomes \d{n,m}+, and so on.
Consider the previous example in which the pattern (\w+)(\d\d)(\w+) is applied to the candidate string X99SuperJava. I established earlier that the regex engine, when faced with the group (\w+), would attempt to match as many characters as possible. I also established that it would release those matches if such a release would help a later group achieve a match.
If you don’t want the group to release its matches to help later groups match, however, then you would use possessive qualifiers simply by following your last expression with an additional plus sign. You form the pattern (\w++)(\d\d)(\w+)—that is, you turn \w+ into \w++.
When the pattern is first run, group(1), namely (\w++), consumes every character in the candidate X99SuperJava. However, because of the existence of the second plus sign, the engine refuses to release any of those matching characters to help the following groups match. You find that group(1) matches, and group(2) and group(3) don’t—the regex as whole doesn’t match. Listing 3-6 demonstrates this.
Listing 3-6. Possessive Qualifier Example
import java.util.regex.*;
public class PossesiveExample{
public static void main(String args[]){
//define the pattern
String regex = "(\\w++)(\\d\\d)(\\w+)";
//compile the pattern
Pattern pattern = Pattern.compile(regex);
//define the candidate string
String candidate = "X99SuperJava";
//extract a matcher for the candidate string
Matcher matcher = pattern.matcher(candidate);
if (matcher.find()){
System.out.println("GROUP 0:" +
matcher.group(0));
System.out.println("GROUP 1:" +
matcher.group(1));
System.out.println("GROUP 2:" + matcher.group(2));
System.out.println("GROUP 3:" + matcher.group(3));
}
else{
System.out.println("NO MATCHES" );
}
System.out.println("Done");
}
}
Next: Reluctant Qualifiers >>
More Java Articles
More By Apress Publishing
|
This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.
|
|