Greediness, Backtracking, and Expression Processing - The Essence of Greediness, Laziness, and Backtracking
(Page 3 of 4 )
The lesson of the preceding section is that it makes no difference whether there are greedy or lazy components to a regex; an overall match takes precedence over an overall non-match. This includes taking from what had been greedy (or giving to what had been lazy) if that’s what is required to achieve a match, because when a “local failure” is hit, the engine keeps going back to the saved states (retracing steps to the piles of bread crumbs), trying the untested paths. Whether greedily or lazily, every possible path is tested before the engine admits failure.
The order that the paths are tested is different between greedy and lazy quantifiers (after all, that’s the whole point of having the two!), but in the end, if no match is to be found, it’s known only after testing every possible path.
If, on the other hand, there exists just one plausible match, both a regex with a greedy quantifier and one with a lazy quantifier find that match, although the series of paths they take to get there may be wildly different. In these cases, selecting greedy or lazy doesn’t influence what is matched, but merely how long or short a path the engine takes to get there (which is an efficiency issue, the subject of Chapter 6).
Finally, if there is more than one plausible match, understanding greediness, laziness, and backtracking allows you to know which is selected. The ".*" example has three plausible matches:
The name "McDonald’s" is said "makudonarudo" in Japanese
We know that ".*", with the greedy star, selects the longest one, and that ".*?", with the lazy star, selects the shortest.
Possessive Quantifiers and Atomic Grouping
The '.625' example on the facing page shows important insights about NFA matching as we know it, and how with that particular example our naïve intents were thwarted. Some flavors do provide tools to help us here, but before looking at them, it’s absolutely essential to fully understand the preceding section, “The Essence of Greediness, Laziness, and Backtracking.” Be sure to review it if you have any doubts.
So, continuing with the '.625' example and recalling what we really want to happen, we know that if the matching can successfully get to the marked position in (\.\d\d[1-9]?)\d+, we never want it to go back. That is, we want [1-9] to match if possible, but if it does, we don’t want that match to be given up. Saying it more forcefully, we would rather have the entire match attempt fail, if need be, before giving up something matched by the [1-9]. (As you’ll recall, the problem before when this regex was applied to '.625' was that it indeed didn’t fail, but instead went back to try the remaining skip-me alternative.)
Well, what if we could somehow eliminate that skip-me alternative (eliminate the state that ? saves before it makes the attempt to match [1-9])? If there was no state to go back to, a match of [1-9] wouldn’t be given up. That’s what we want! Ah, but if there was no skip-me state to go back to, what would happen if we applied the regex to '.5000'? The [1-9] couldn’t match, and in this case, we do want it to go back and skip the [1-9] so that the subsequent \d+ can match digits to be removed.
It sounds like we have two conflicting desires, but thinking about it, what we really want is to eliminate the skip-me alternative only if the match-me alternative succeeds. That is, if [1-9] is indeed able to match, we’d like to get rid of the skip-me saved state so that it is never given up. This is possible, with regex flavors that support (?>...) atomic grouping ( 139), or possessive quantifiers like [1-9]?+ ( 142). We’ll look at atomic grouping first.
Atomic grouping with (?>...)
In essence, matching within (?>...) carries on normally, but if and when matching is able to exit the construct (that is, get past its closing parenthesis), all states that had been saved while within it are thrown away. In practice, this means that once the atomic grouping has been exited, whatever text was matched within it is now one unchangeable unit, to be kept or given back only as a whole. All saved states
representing untried options within the parentheses are eliminated, so backtracking can never undo any of the decisions made within (at least not once they’re “locked in” when the construct is exited).
So, let’s consider (\.\d\d(?> [1-9]?))\d+ . Quantifiers work normally within atomic grouping, so if [1-9] is not able to match, the regex returns to the skip-me saved state the ? had left. That allows matching to leave the atomic grouping and continue on to the \d+. In this case, there are no saved states to flush when control leaves the atomic grouping (that is, there are no saved states remaining that had been created within it).
However, when [1-9] is able to match, matching can exit the atomic grouping, but this time, the skip-me state is still there. Since it had been created within the atomic grouping we’re now exiting, it is thrown away. This would happen when matching against both '.625', and, say, '.625000'. In the latter case, having eliminated the state turns out not to matter, since the \d+ has the '.625000' to match. after which that regex is done. With '.625' alone, the inability of \d+ to match has the regex engine wanting to backtrack, but it can’t since that skip-me alternative was thrown away. The lack of any state to backtrack to results in the overall match
attempt failing, and '.625' is left undisturbed as we wish.
Next: The essence of atomic grouping >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|