Java
  Home arrow Java arrow Greediness, Backtracking, and Expression P...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Greediness, Backtracking, and Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 3 stars3 stars3 stars3 stars3 stars / 2
    2007-01-11

    Table of Contents:
  • Greediness, Backtracking, and Expression Processing
  • Using Lazy Quantifiers
  • The Essence of Greediness, Laziness, and Backtracking
  • The essence of atomic grouping

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Greediness, Backtracking, and Expression Processing


    (Page 1 of 4 )

    This article, the third of a four-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

    More About Greediness and Backtracking

    Many concerns (and benefits) of greediness are
    shared by both an NFA and a DFA. (A DFA doesn’t support laziness, which is why we’ve concentrated on greediness up to this point.) I’d like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there’s not much


    Quiz Answer

    Answer to the question on page 162.

    When matching [0-9] * against  ‘a 1234 num', would 'a 1234 num' be part of a saved state?

    The answer is “no.” I posed this question because the mistake is commonly made. Remember, a component that has star applied can always match. If that’s the entire regex, it can always match anywhere. This certainly includes the attempt when the transmission applies the engine the first time, at the start of the string. In this case, the regex matches at ‘a 1234 num’ and that’s the end of it—it never even gets as far the digits.

    In case you missed this, there’s still a chance for partial credit. Had there been something in the regex after the [0-9] * that kept an overall match from happening before the engine got to:

    then indeed, the attempt of the ‘1’ also creates the state:

     


    more to say after that. It’s very easy to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed
    nature provides. Besides lazy quantifiers, there are a variety of extra features an NFA can support, including lookaround, conditionals, backreferences, and atomic grouping. And on top of these, an NFA affords the regex author direct control over how a match is carried out, which can be a benefit when used properly, but it does create some efficiency-related pitfalls (discussed in Chapter 6.)

     

    Despite these differences, the match results are often similar. For the next few pages, I’ll talk of both engine types, but describe effects in terms of the regex-
    directed NFA. By the end of this chapter, you’ll have a firm grasp of just when the results might differ, as well as exactly why.

    Problems of Greediness

    As we saw with the last example, .* always marches to the end of the line.This is because .* just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.

    Sometimes this can be a real pain. Consider a regex to match text wrapped in double quotes. At first, you might want to write ".*", but knowing what we know about .*, guess where it matches in:

      The name "McDonald’s" is said "makudonarudo" in Japanese

    Actually, since we understand the mechanics of matching, we don’t need to guess, because we know. Once the initial quote matches, .* is free to match, and immediately does so all the way to the end of the string. It backs off (or, perhaps more appropriately, is backed off by the regex engine) only as much as is needed until the final quote can match. In the end, it matches

      The name "McDonald’s" is said "makudonarudo" in Japanese

    which is obviously not the double-quoted string that was intended. This is one reason why I caution against the overuse of .*, as it can often lead to surprising results if you don’t pay careful attention to greediness.

    So, how can we have it match "McDonald’s" only? The key is to realize that we don’t want “anything” between the quotes, but rather “anything except a quote.” If we use [ˆ"] * rather than .*, it won’t overshoot the closing quote.

    The regex engine’s basic approach with " [ˆ"] *" is exactly the same as before. Once the initial double quote matches, [ˆ"] * gets a shot at matching as much as it can. In this case, that’s up to the double quote after McDonald’s, at which point it finally stops because [ˆ"] can’t match the quote. At that point, control moves to the closing ". It happily matches, resulting in overall success:

      The name "McDonald’s" is said "makudonarudo" in Japanese

    Actually, there could be one unexpected change, and that’s because in most flavors, [ˆ"] can match a newline, while dot doesn’t. If you want to keep the regex from crossing lines, use [ˆ"\n] .

    Multi-Character “Quotes”

    In the first chapter, I talked a bit about matching HTML tags, such as the sequence <B>very</B> that renders the “very” in bold if the browser can do so. Attempting to match a <B>...</B> sequence seems similar to matching a quoted string, except the “quotes” in this case are the multi-character sequences <B> and </B>. Like the quoted string example, multiple sets of “quotes” cause problems if we use .*:

      ˙˙˙<B>Billions</B> and <B>Zillions</B> of suns...

    With <B> .* </B>, the greedy .* causes the match in progress to zip to the end of the line, backtracking only far enough to allow the </B> to match, matching the last </B> on the line instead of the one corresponding to the opening <B> at the start of the match.

    Unfortunately, since the closing delimiter is more than one character, we can’t solve the problem with a negated class as we did with double-quoted strings. We can’t expect something like <B>[ˆ</B>]*</B> to work. A character class represents only one character and not the full </B> sequence that we want. Don’t let the apparent structure of [ˆ</B>] fool you. It is just a class to match one character—any one except <, >/, and B. It is the same as, say [^/<>B], and certainly  doesn’t work as an “anything not </B>"  construct. (With lookahead, you can insist that </B> not match at a particular point; we’ll see this in action in the next section.)

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
    Stay green...Green IT