Java
  Home arrow Java arrow Page 3 - Greediness, Backtracking, and Expression P...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Greediness, Backtracking, and Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 3 stars3 stars3 stars3 stars3 stars / 2
    2007-01-11

    Table of Contents:
  • Greediness, Backtracking, and Expression Processing
  • Using Lazy Quantifiers
  • The Essence of Greediness, Laziness, and Backtracking
  • The essence of atomic grouping

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Greediness, Backtracking, and Expression Processing - The Essence of Greediness, Laziness, and Backtracking


    (Page 3 of 4 )

    The lesson of the preceding section is that it makes no difference whether there are greedy or lazy components to a regex; an overall match takes precedence over an overall non-match. This includes taking from what had been greedy (or giving to what had been lazy) if that’s what is required to achieve a match, because when a “local failure” is hit, the engine keeps going back to the saved states (retracing steps to the piles of bread crumbs), trying the untested paths. Whether greedily or lazily, every possible path is tested before the engine admits failure.

    The order that the paths are tested is different between greedy and lazy quantifiers (after all, that’s the whole point of having the two!), but in the end, if no match is to be found, it’s known only after testing every possible path.

    If, on the other hand, there exists just one plausible match, both a regex with a greedy quantifier and one with a lazy quantifier find that match, although the series of paths they take to get there may be wildly different. In these cases, selecting greedy or lazy doesn’t influence what is matched, but merely how long or short a path the engine takes to get there (which is an efficiency issue, the subject of Chapter 6).

    Finally, if there is more than one plausible match, understanding greediness, laziness, and backtracking allows you to know which is selected. The ".*" example has three plausible matches:

      The name "McDonald’s" is said "makudonarudo" in Japanese

    We know that ".*", with the greedy star, selects the longest one, and that ".*?", with the lazy star, selects the shortest.

    Possessive Quantifiers and Atomic Grouping

    The '.625' example on the facing page shows important insights about NFA matching as we know it, and how with that particular example our naïve intents were thwarted. Some flavors do provide tools to help us here, but before looking at them, it’s absolutely essential to fully understand the preceding section, “The Essence of Greediness, Laziness, and Backtracking.” Be sure to review it if you have any doubts.

    So, continuing with the '.625' example and recalling what we really want to happen, we know that if the matching can successfully get to the marked position in (\.\d\d[1-9]?)\d+, we never want it to go back. That is, we want [1-9] to match if possible, but if it does, we don’t want that match to be given up. Saying it more forcefully, we would rather have the entire match attempt fail, if need be, before giving up something matched by the [1-9]. (As you’ll recall, the problem before when this regex was applied to '.625' was that it indeed didn’t fail, but instead went back to try the remaining skip-me alternative.)

    Well, what if we could somehow eliminate that skip-me alternative (eliminate the state that ? saves before it makes the attempt to match [1-9])? If there was no state to go back to, a match of [1-9] wouldn’t be given up. That’s what we want! Ah, but if there was no skip-me state to go back to, what would happen if we applied the regex to '.5000'? The [1-9] couldn’t match, and in this case, we do want it to go back and skip the [1-9] so that the subsequent \d+ can match digits to be removed.

    It sounds like we have two conflicting desires, but thinking about it, what we really want is to eliminate the skip-me alternative only if the match-me alternative succeeds. That is, if [1-9] is indeed able to match, we’d like to get rid of the skip-me saved state so that it is never given up. This is possible, with regex flavors that support (?>...) atomic grouping ( 139), or possessive quantifiers like [1-9]?+  ( 142). We’ll look at atomic grouping first.

    Atomic grouping with (?>...)

    In essence, matching within (?>...) carries on normally, but if and when matching is able to exit the construct (that is, get past its closing parenthesis), all states that had been saved while within it are thrown away. In practice, this means that once the atomic grouping has been exited, whatever text was matched within it is now one unchangeable unit, to be kept or given back only as a whole. All saved states
    representing untried options within the parentheses are eliminated, so backtracking can never undo any of the decisions made within (at least not once they’re “locked in” when the construct is exited).

    So, let’s consider (\.\d\d(?> [1-9]?))\d+ .  Quantifiers work normally within atomic grouping, so if [1-9] is not able to match, the regex returns to the skip-me saved state the ? had left. That allows matching to leave the atomic grouping and continue on to the \d+. In this case, there are no saved states to flush when control leaves the atomic grouping (that is, there are no saved states remaining that had been created within it).

    However, when [1-9] is able to match, matching can exit the atomic grouping, but this time, the skip-me state is still there. Since it had been created within the atomic grouping we’re now exiting, it is thrown away. This would happen when matching against both '.625', and, say, '.625000'.  In the latter case, having eliminated the state turns out not to matter, since the \d+ has the '.625000' to match. after which that regex is done. With '.625' alone, the inability of \d+  to match has the regex engine wanting to backtrack, but it can’t since that skip-me alternative was thrown away. The lack of any state to backtrack to results in the overall match
    attempt failing, and '.625' is left undisturbed as we wish.

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 3 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek