Java
  Home arrow Java arrow Page 2 - Greediness, Backtracking, and Expression P...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Greediness, Backtracking, and Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 3 stars3 stars3 stars3 stars3 stars / 2
    2007-01-11

    Table of Contents:
  • Greediness, Backtracking, and Expression Processing
  • Using Lazy Quantifiers
  • The Essence of Greediness, Laziness, and Backtracking
  • The essence of atomic grouping

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Greediness, Backtracking, and Expression Processing - Using Lazy Quantifiers


    (Page 2 of 4 )

    These problems arise because the standard quantifiers are greedy. Some NFAs support lazy quantifiers ( 141), with *? being the lazy version of *.  With that in mind, let’s apply <B> .*?</B> to:

       ...<B>Billions</B> and <B>Zillions</B> of suns...

    After the initial <B> has matched, .*? immediately decides that since it doesn’t require any matches, it lazily doesn’t bother trying to perform any. So, it immediately passes control to the following <:

    The < doesn’t match at that point, so control returns back to .*? where it still has its untried option to attempt a match (to attempt multiple matches, actually). It begrudgingly does so, with the dot matching the underlined B in ˙˙˙<B>Billions˙˙˙. Again, the *? has the option to match more, or to stop. It’s lazy, so it first tries stopping. The subsequent < still fails, so .*? has to again exercise its untried match option. After eight cycles, .*? eventually matches Billions, at which point the subsequent < (and the whole </B> subexpression) is finally able to match: 

      ˙˙˙<B>Billions</B> and <B>Zillions</B> of suns˙˙˙

    So, as we’ve seen, the greediness of star and friends can be a real boon at times, while troublesome at others. Having non-greedy, lazy versions is wonderful, as they allow you to do things that are otherwise very difficult (or even impossible). Still, I’ve often seen inexperienced programmers use lazy quantifiers in inappropriate situations. In fact, what we’ve just done may not be appropriate. Consider applying <B>.*? </B> to:

      ˙˙˙<B>Billions and <B>Zillions</B> of suns˙˙˙

    It matches as shown, and while I suppose it depends on the exact needs of the situation, I would think that in this case that match is not desired. However,
    there’s nothing about .*? to stop it from marching right past the Zillion’s <B> to its </B>.

    This is an excellent example of why a lazy quantifier is often not a good replacement for a negated class. In the ".*" example, using [ˆ"] as a replacement for the dot specifically disallows it from marching past a delimiter—a quality we wish our current regex had.

    However, if negative lookahead ( 133) is supported, you can use it to create something comparable to a negated class. Alone, (?!<B>)  is a test that is successful if <B> is not at the current location in the string. Those are the locations that we want the dot of <B>.*?</B>  to match, so changing that dot to
    ((?!<B>) .) creates a regex that matches where we want it, but doesn’t match where we don’t. Assembled all together, the whole thing can become quite confusing, so I’ll show it here in a free-spacing mode ( 111) with comments:

    <B>

    # Match the opening <B>

    (

    # Now, only as many of the following as needed . . .

    (?! <B>

    )

    #

    If not <B> . . .

    .

     

    #

    . . . any character is okay

    )*?

     

    #

     

    </B>

    # . . . until the closing delimiter can match

    With one adjustment to the lookahead, we can put the quantifier back to a normal greedy one, which may be less confusing to some:

    <B>

    # Match the opening <B>

    (

    # Now, as many of the following as possible . . .

    (?! </?B>

    )

    #

    If not <B>, and not </B> . . .

    .

     

    #

    . . . any character is okay

    )+

    # (now greedy)

    </B>

    # <ANNO> . . . until the closing delimiter can match.

    Now, the lookahead prohibits the main body to match beyond </B> as well as <B>, which eliminates the problem we tried to solve with laziness, so the laziness can be removed. This expression can still be improved; we’ll see it again during the discussion on efficiency in Chapter 6 ( 270).

    Greediness and Laziness Always Favor a Match

    Recall the price display example from Chapter 2 ( 51). We’ll examine this example in detail at a number of points during this chapter, so I’ll recap the basic issue: due to floating-point representation problems, values that should have been “1.625” or “3.00” were sometimes coming out like “1.62500000002828” and “3.00000000028822”. To fix this, I used

      $price =˜ s/(\.\d\d[1-9]?)\d*/$1/;

    to lop off all but the first two or three decimal digits from the value stored in the variable $price. The \.\d\d matches the first two decimal digits regardless, while the [1-9]? matches the third digit only if it is non-zero.

    I then noted:

    Anything matched so far is what we want to keep, so we wrap it in parentheses to capture to $1. We can then use $1 in the replacement string. If this is the only thing that matches, we replace exactly what was matched with itself—not very useful. However, we go on to match other items outside the $1 parentheses. They don’t find their way to the replacement string, so the effect is that they’re removed. In this case, the “to be removed” text is any extra digits, the \d* at the end of the regex.

    So far so good, but let’s consider what happens when the contents of the variable $price is already well formed. When it is 27.625, the (\.\d\d[1-9]?) part matches the entire decimal part. Since the trailing  \d* doesn’t match anything, the substitution replaces the '.625' with '.625' — an effective no-op.

    This is the desired result, but wouldn’t it be just a bit more efficient to do the replacement only when it would have some real effect (that is, do the replacement only when \d* actually matches something) ? Well, we know how to write “at least one digit”! Simply replace \d* with \d*:

      $price =˜ s/(\.\d\d[1-9]?)\d+/$1/

    With crazy numbers like “1.62500000002828”, it still works as before, but with something such as “9.43”, the trailing \d+ isn’t able to match, so rightly, no substitution occurs. So, this is a great modication, yes? No! What happens with a three-digit decimal value like 27.625?. We want this value to be left alone, but that’s not what happens. Stop for a moment to work through the match of 27.625 yourself, with particular attention to how the '5' interacts with the regex.

    In hindsight, the problem is really fairly simple. Picking up in the action once (\.\d\d[1-9]?)\d+ has matched 27.625, we find that \d+ can’t match. That’s no problem for the overall match, though, since as far as the regex is concerned, the match of 5’ by [1-9]  was optional and there is still a saved state to try. This state allows [1-9]? to match nothing, leaving the 5 to fulfill the must-match-one requirement of \d+. Thus, we get the match, but not the right match: .625 is replaced by.62, </B>, and the value becomes incorrect.

    What if [1-9]? were lazy instead? We’d get the same match, but without the intervening “match the 5 but then give it back” steps, since the lazy [1-9]?? first skips the match attempt. So, laziness is not a solution to this problem.

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek