Greediness, Backtracking, and Expression Processing - Using Lazy Quantifiers
(Page 2 of 4 )
These problems arise because the standard quantifiers are greedy. Some NFAs support lazy quantifiers ( 141), with *? being the lazy version of *. With that in mind, let’s apply <B> .*?</B> to:
...<B>Billions</B> and <B>Zillions</B> of suns...
After the initial <B> has matched, .*? immediately decides that since it doesn’t require any matches, it lazily doesn’t bother trying to perform any. So, it immediately passes control to the following <:

The < doesn’t match at that point, so control returns back to .*? where it still has its untried option to attempt a match (to attempt multiple matches, actually). It begrudgingly does so, with the dot matching the underlined B in ˙˙˙<B>Billions˙˙˙. Again, the *? has the option to match more, or to stop. It’s lazy, so it first tries stopping. The subsequent < still fails, so .*? has to again exercise its untried match option. After eight cycles, .*? eventually matches Billions, at which point the subsequent < (and the whole </B> subexpression) is finally able to match: ˙˙˙<B>Billions</B> and <B>Zillions</B> of suns˙˙˙
So, as we’ve seen, the greediness of star and friends can be a real boon at times, while troublesome at others. Having non-greedy, lazy versions is wonderful, as they allow you to do things that are otherwise very difficult (or even impossible). Still, I’ve often seen inexperienced programmers use lazy quantifiers in inappropriate situations. In fact, what we’ve just done may not be appropriate. Consider applying <B>.*? </B> to:
˙˙˙<B>Billions and <B>Zillions</B> of suns˙˙˙
It matches as shown, and while I suppose it depends on the exact needs of the situation, I would think that in this case that match is not desired. However,
there’s nothing about .*? to stop it from marching right past the Zillion’s <B> to its </B>.
This is an excellent example of why a lazy quantifier is often not a good replacement for a negated class. In the ".*" example, using [ˆ"] as a replacement for the dot specifically disallows it from marching past a delimiter—a quality we wish our current regex had.
However, if negative lookahead ( 133) is supported, you can use it to create something comparable to a negated class. Alone, (?!<B>) is a test that is successful if <B> is not at the current location in the string. Those are the locations that we want the dot of <B>.*?</B> to match, so changing that dot to
((?!<B>) .) creates a regex that matches where we want it, but doesn’t match where we don’t. Assembled all together, the whole thing can become quite confusing, so I’ll show it here in a free-spacing mode ( 111) with comments:
<B> | # Match the opening <B> |
( | # Now, only as many of the following as needed . . . |
(?! <B> | ) | # | If not <B> . . . |
. | | # | . . . any character is okay |
)*? | | # | |
</B> | # . . . until the closing delimiter can match |
With one adjustment to the lookahead, we can put the quantifier back to a normal greedy one, which may be less confusing to some:
<B> | # Match the opening <B> |
( | # Now, as many of the following as possible . . . |
(?! </?B> | ) | # | If not <B>, and not </B> . . . |
. | | # | . . . any character is okay |
)+ | # (now greedy) |
</B> | # <ANNO> . . . until the closing delimiter can match. |
Now, the lookahead prohibits the main body to match beyond </B> as well as <B>, which eliminates the problem we tried to solve with laziness, so the laziness can be removed. This expression can still be improved; we’ll see it again during the discussion on efficiency in Chapter 6 ( 270).
Greediness and Laziness Always Favor a Match
Recall the price display example from Chapter 2 ( 51). We’ll examine this example in detail at a number of points during this chapter, so I’ll recap the basic issue: due to floating-point representation problems, values that should have been “1.625” or “3.00” were sometimes coming out like “1.62500000002828” and “3.00000000028822”. To fix this, I used
$price =˜ s/(\.\d\d[1-9]?)\d*/$1/;
to lop off all but the first two or three decimal digits from the value stored in the variable $price. The \.\d\d matches the first two decimal digits regardless, while the [1-9]? matches the third digit only if it is non-zero.
I then noted:
Anything matched so far is what we want to keep, so we wrap it in parentheses to capture to $1. We can then use $1 in the replacement string. If this is the only thing that matches, we replace exactly what was matched with itself—not very useful. However, we go on to match other items outside the $1 parentheses. They don’t find their way to the replacement string, so the effect is that they’re removed. In this case, the “to be removed” text is any extra digits, the \d* at the end of the regex.
So far so good, but let’s consider what happens when the contents of the variable $price is already well formed. When it is 27.625, the (\.\d\d[1-9]?) part matches the entire decimal part. Since the trailing \d* doesn’t match anything, the substitution replaces the '.625' with '.625' — an effective no-op.
This is the desired result, but wouldn’t it be just a bit more efficient to do the replacement only when it would have some real effect (that is, do the replacement only when \d* actually matches something) ? Well, we know how to write “at least one digit”! Simply replace \d* with \d*:
$price =˜ s/(\.\d\d[1-9]?)\d+/$1/
With crazy numbers like “1.62500000002828”, it still works as before, but with something such as “9.43”, the trailing \d+ isn’t able to match, so rightly, no substitution occurs. So, this is a great modication, yes? No! What happens with a three-digit decimal value like 27.625?. We want this value to be left alone, but that’s not what happens. Stop for a moment to work through the match of 27.625 yourself, with particular attention to how the '5' interacts with the regex.
In hindsight, the problem is really fairly simple. Picking up in the action once (\.\d\d[1-9]?)\d+ has matched 27.625, we find that \d+ can’t match. That’s no problem for the overall match, though, since as far as the regex is concerned, the match of ‘5’ by [1-9] was optional and there is still a saved state to try. This state allows [1-9]? to match nothing, leaving the 5 to fulfill the must-match-one requirement of \d+. Thus, we get the match, but not the right match: .625 is replaced by.62, </B>, and the value becomes incorrect.
What if [1-9]? were lazy instead? We’d get the same match, but without the intervening “match the 5 but then give it back” steps, since the lazy [1-9]?? first skips the match attempt. So, laziness is not a solution to this problem.
Next: The Essence of Greediness, Laziness, and Backtracking >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|