The Mechanics of Expression Processing - Rule 2: The Standard Quantifiers Are Greedy
(Page 4 of 4 )
So far, we have seen features that are quite straightforward. They are also rather boring—you can’t do much without involving more-powerful metacharacters such as star, plus, alternation, and so on. Their added power requires more infor mation to understand them fully.
First, you need to know that the standard quantifiers (?, *, +, and {min, max} ) are greedy. When one of these governs a subexpression, such as a in a?, the (expr) in (expr) *, or [0-9] in [0-9]+, there is a minimum number of matches that are required before it can be considered successful, and a maximum number that it will ever attempt to match. This has been mentioned in earlier chapters—what’s new here concerns the rule that they always attempt to match as much as possible. (Some flavors provide other types of quantifiers, but this section is concerned only with the standard, greedy ones.)
To be clear, the standard quantifiers settle for something less than the maximum number of allowed matches if they have to, but they always attempt to match as many times as they can, up to that maximum allowed. The only time they settle for anything less than their maximum allowed is when matching too much ends up causing some later part of the regex to fail. A simple example is using \b\w+s\b to match words ending with an 's', such as ‘regexes'. The \w+ alone is happy to match the entire word, but if it does, it leaves nothing for the s to match. To achieve the overall match, the \w+ must settle for matching only regexes', thereby allowing s\b (and thus the full regex) to match.
If it turns out that the only way the rest of the regex can succeed is when the greedy construct in question matches nothing, well, that’s perfectly fine, if zero matches are allowed (as with star, question, and {0, max} intervals). However, it turns out this way only if the requirements of some later subexpression force the issue. It’s because the greedy quantifiers always (or, at least, try to) take more than they minimally need that they are called greedy.
Greediness has many useful (but sometimes troublesome) implications. It explains, for example, why [0-9] + matches the full number in March 1998. Once the '1' has been matched, the plus has fulfilled its minimum requirement, but it’s greedy, so it doesn’t stop. So, it continues, and matches the ‘998' before being forced to stop by the end of the string. (Since [0-9] can’t match the nothingness at the end of the string, the plus finally stops.)
A subjective example
Of course, this method of grabbing things is useful for more than just numbers. Let’s say you have a line from an email header and want to check whether it is the subject line. As we saw in earlier chapters ( 55), you simply use ˆSubject:
However, if you use ^Subject: (.*), you can later access the text of the subject itself via the tool’s after-the-fact parenthesis memory (for example, $1 in Perl).†
Before looking at why .* matches the entire subject, be sure to understand that once the ˆSubject: part matches, you’re guaranteed that the entire regular expression will eventually match. You know this because there’s nothing after ˆSubject: that could cause the expression to fail; .* can never fail, since the worst case of “no matches” is still considered successful for star.
So, why do we even bother adding .*? Well, we know that because star is greedy, it attempts to match dot as many times as possible, so we use it to “fill” $1. In fact, the parentheses add nothing to the logic of what the regular expression matches—in this case we use them simply to capture the text matched by .*.
Once .* hits the end of the string, the dot isn’t able to match, so the star finanally stops and lets the next item in the regular expression attempt to match (for even though the starred dot could match no further, perhaps a subexpression later in the regex could). Ah, but since it turns out that there is no next item, we reach the end of the regex and we know that we have a successful match.
Being too greedy
Let’s get back to the concept of a greedy quantifier being as greedy as it can be. Consider how the matching and results would change if we add another .*: ^Subject: (.*) .*. The answer is: nothing would change. The initial .* (inside the parentheses) is so greedy that it matches all the subject text, never leaving anything for the second .* to match. Again, the failure of the second .* to match something is not a problem, since the star does not require a match to be successful. Were the second .* in parentheses as well, the resulting $2 would always be empty.
Does this mean that after .*, a regular expression can never have anything that is expected to actually match? No, of course not. As we saw with the \w+s example, it is possible for something later in the regex to force something previously greedy to give back (that is, relinquish or conceptually “unmatch”) if that’s what is necessary to achieve an overall match.
Let’s consider the possibly useful ˆ.* ([0-9][0-9]), which finds the last two digits on a line, wherever they might be, and saves them to $1. Here’s how it works: at first, .* matches the entire line. Because the following ([0-9] [0-9]) is required, its initial failure to match at the end of the line, in effect, tells .* “Hey, you took too much! Give me back something so that I can have a chance to match.” Greedy components first try to take as much as they can, but they always defer to the greater need to achieve an overall match. They’re just stubborn about it, and only do so when forced. Of course, they’ll never give up something that hadn’t been optional in the first place, such as a plus quantifier’s first match.
With this in mind, let’s apply ˆ.* ([0-9] [0-9]) to about 24 characters long'. Once .* matches the whole string, the requirement for the first [0-9] to match forces .* to give up ‘g’ (the last thing it had matched). That doesn’t, however, allow [0-9] to match, so .* is again forced to relinquish something, this time the ‘n’. This cycle continues 15 more times until .* finally gets around to giving up ‘4’.
Unfortunately, even though the first [0-9] can then match that ‘4’, the second still cannot. So, .* is forced to relinquish once more in an attempt fo find an overall match. This time .* gives up the ‘2’, which the first
[0-9] can then match. Now, the ‘4’ is free for the second [0-9] to match, and so the entire expression matches about 24 char...', with $1 getting '24'.
First come, first served
Consider now using ^.* ([0-9]+) </B>, ostensibly to match not just the last two digits, but the last whole number, however long it might be. When this regex is applied to ‘Copyright 2003.’, what is captured? Turn the page to check your answer. Getting down to the details
I should clear up a few things here. Phrases like “ the .*, gives up...” and “ the [0-9] forces...” are slightly misleading. I used these terms because they’re easy to grasp, and the end result appears to be the same as reality. However, what really happens behind the scenes depends on the basic engine type, DFA or NFA. So, it’s time to see what these really are.
Please check back next week for the continuation of this article.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|