Home arrow Java arrow Page 3 - Greediness, Backtracking, and Expression Processing
JAVA

Greediness, Backtracking, and Expression Processing


This article, the third of a four-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 4 stars4 stars4 stars4 stars4 stars / 3
January 11, 2007
TABLE OF CONTENTS:
  1. · Greediness, Backtracking, and Expression Processing
  2. · Using Lazy Quantifiers
  3. · The Essence of Greediness, Laziness, and Backtracking
  4. · The essence of atomic grouping

print this article
SEARCH DEVARTICLES

TOOLS YOU CAN USE

advertisement
Greediness, Backtracking, and Expression Processing - The Essence of Greediness, Laziness, and Backtracking
(Page 3 of 4 )

The lesson of the preceding section is that it makes no difference whether there are greedy or lazy components to a regex; an overall match takes precedence over an overall non-match. This includes taking from what had been greedy (or giving to what had been lazy) if that’s what is required to achieve a match, because when a “local failure” is hit, the engine keeps going back to the saved states (retracing steps to the piles of bread crumbs), trying the untested paths. Whether greedily or lazily, every possible path is tested before the engine admits failure.

The order that the paths are tested is different between greedy and lazy quantifiers (after all, that’s the whole point of having the two!), but in the end, if no match is to be found, it’s known only after testing every possible path.

If, on the other hand, there exists just one plausible match, both a regex with a greedy quantifier and one with a lazy quantifier find that match, although the series of paths they take to get there may be wildly different. In these cases, selecting greedy or lazy doesn’t influence what is matched, but merely how long or short a path the engine takes to get there (which is an efficiency issue, the subject of Chapter 6).

Finally, if there is more than one plausible match, understanding greediness, laziness, and backtracking allows you to know which is selected. The ".*" example has three plausible matches:

  The name "McDonald’s" is said "makudonarudo" in Japanese

We know that ".*", with the greedy star, selects the longest one, and that ".*?", with the lazy star, selects the shortest.

Possessive Quantifiers and Atomic Grouping

The '.625' example on the facing page shows important insights about NFA matching as we know it, and how with that particular example our naïve intents were thwarted. Some flavors do provide tools to help us here, but before looking at them, it’s absolutely essential to fully understand the preceding section, “The Essence of Greediness, Laziness, and Backtracking.” Be sure to review it if you have any doubts.

So, continuing with the '.625' example and recalling what we really want to happen, we know that if the matching can successfully get to the marked position in (\.\d\d[1-9]?)\d+, we never want it to go back. That is, we want [1-9] to match if possible, but if it does, we don’t want that match to be given up. Saying it more forcefully, we would rather have the entire match attempt fail, if need be, before giving up something matched by the [1-9]. (As you’ll recall, the problem before when this regex was applied to '.625' was that it indeed didn’t fail, but instead went back to try the remaining skip-me alternative.)

Well, what if we could somehow eliminate that skip-me alternative (eliminate the state that ? saves before it makes the attempt to match [1-9])? If there was no state to go back to, a match of [1-9] wouldn’t be given up. That’s what we want! Ah, but if there was no skip-me state to go back to, what would happen if we applied the regex to '.5000'? The [1-9] couldn’t match, and in this case, we do want it to go back and skip the [1-9] so that the subsequent \d+ can match digits to be removed.

It sounds like we have two conflicting desires, but thinking about it, what we really want is to eliminate the skip-me alternative only if the match-me alternative succeeds. That is, if [1-9] is indeed able to match, we’d like to get rid of the skip-me saved state so that it is never given up. This is possible, with regex flavors that support (?>...) atomic grouping ( 139), or possessive quantifiers like [1-9]?+  ( 142). We’ll look at atomic grouping first.

Atomic grouping with (?>...)

In essence, matching within (?>...) carries on normally, but if and when matching is able to exit the construct (that is, get past its closing parenthesis), all states that had been saved while within it are thrown away. In practice, this means that once the atomic grouping has been exited, whatever text was matched within it is now one unchangeable unit, to be kept or given back only as a whole. All saved states
representing untried options within the parentheses are eliminated, so backtracking can never undo any of the decisions made within (at least not once they’re “locked in” when the construct is exited).

So, let’s consider (\.\d\d(?> [1-9]?))\d+ .  Quantifiers work normally within atomic grouping, so if [1-9] is not able to match, the regex returns to the skip-me saved state the ? had left. That allows matching to leave the atomic grouping and continue on to the \d+. In this case, there are no saved states to flush when control leaves the atomic grouping (that is, there are no saved states remaining that had been created within it).

However, when [1-9] is able to match, matching can exit the atomic grouping, but this time, the skip-me state is still there. Since it had been created within the atomic grouping we’re now exiting, it is thrown away. This would happen when matching against both '.625', and, say, '.625000'.  In the latter case, having eliminated the state turns out not to matter, since the \d+ has the '.625000' to match. after which that regex is done. With '.625' alone, the inability of \d+  to match has the regex engine wanting to backtrack, but it can’t since that skip-me alternative was thrown away. The lack of any state to backtrack to results in the overall match
attempt failing, and '.625' is left undisturbed as we wish.


blog comments powered by Disqus
JAVA ARTICLES

- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...
- Generics and Limitations in Java
- Getting Started with Java Web Development in...

Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 



© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 8 - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials