Java
  Home arrow Java arrow Page 4 - Greediness, Backtracking, and Expression P...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Greediness, Backtracking, and Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 3 stars3 stars3 stars3 stars3 stars / 2
    2007-01-11

    Table of Contents:
  • Greediness, Backtracking, and Expression Processing
  • Using Lazy Quantifiers
  • The Essence of Greediness, Laziness, and Backtracking
  • The essence of atomic grouping

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Greediness, Backtracking, and Expression Processing - The essence of atomic grouping


    (Page 4 of 4 )

    The section “The Essence of Greediness, Laziness, and Backtracking,” starting on page 168, makes the important point that neither greediness nor laziness influence which paths can be checked, but merely the order in which they are checked. If no match is found, whether by a greedy or a lazy ordering, in the end, every possible path will have been checked.

    Atomic grouping, on the other hand, is fundamentally different because it actually eliminates possible paths. Eliminating states can have a number of different consequences, depending on the situation:

    • No Effect  If a match is reached before one of the eliminated states would have been called upon, there is no effect on the match. We saw this a moment ago with the '.625000'  example. A match was found before the eliminated state would have come into play.
    • Prohibit Match  The elimination of states can mean that a match that would have otherwise been possible now becomes impossible. We saw this with the '.625' example.
    • Different Match  In some cases, it’s possible to get a different match due to the elimination of states.
    • Faster Failure  It’s possible for the elimination of states to do nothing more than allow the regex engine, when no match is to be found, report that fact more quickly. This is discussed right after the quiz.

    Here’s a little quiz: what does the construct
    (?>.*?) do? What kind of things do you expect it can match?  Turn the page to check your answer.

    Some states may remain.  When the engine exits atomic grouping during a match, only states that had been created while inside the atomic grouping are eliminated. States that might have been there before still remain after, so the entire text matched by the atomic subexpression may be unmatched, as a whole, if backtracking later reverts to one of those previous states.

    Faster failures with atomic grouping.  Consider  ˆ\w+: applied to 'Subject'. We can see, just by looking at it, that it will fail because the text doesn’t have a colon in it, but the regex engine won’t reach that conclusion until it actually goes through the motions of checking.

    So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states—one “skip me” state for each match of \w  by the plus (except the first, since plus requires one match). When then checked at the end of the string, : fails, so the regex engine backtracks to the most recently saved state:

    at which point the : fails again, this time trying to match 't'. This backtrack-test-fail cycle happens all the way back to the oldest state:

    After the attempt from the final state fails, overall failure can finally be announced.


    Quiz Answer

    Answer to the question on page 171.

    What does (?>.*?)  match?

    It can never match, anything. At best, it’s a fairly complex way to accomplish nothing! *? is the lazy *, and governs a dot, so the first path it attempts is the skip-the-dot path, saving the try-the-dot state for later, if required. But the moment that state has been saved, it’s thrown away because matching exits the atomic grouping, so the skip-the-dot path is the only one ever taken. If something is always skipped, it’s as if it’s not there at all.


    All that backtracking is a lot of work that after just a glance we know to be unnecessary. If the colon can’t match after the last letter, it certainly can’t match one of the letters the + is forced to give up!

    So, knowing that none of the states left by \w+, once it’s finished, could possibly lead to a match, we can save the regex engine the trouble of checking them:
    ˆ(?>\w+: By adding the atomic grouping, we use our global knowledge of the regex to enhance the local working of \w+ by having its saved states (which we know to be useless) thrown away. If there is a match, the atomic grouping won’t have mattered, but if there’s not to be a match, having thrown away the useless states lets the regex come to that conclusion more quickly. (An advanced implementation may be able to apply this optimization for you automatically 251.)

    As we’ll see in the Chapter 6 ( 269), this technique shows a very valuable use of atomic grouping, and I suspect it will become the most common use as well.

    Possessive Quantifiers, ?+, *+, ++, and {m,n}+

    Possessive quantifiers are much like greedy
    quantifiers, but they never give up a partial amount of what they’ve been able to match. Once a plus, for example, finishes its run, it has created quite a few saved states, as we saw with the ˆ\w+  example. A possessive plus simply throws those states away (or, more likely, doesn’t bother creating them in the first place).

    As you might guess, possessive quantifiers are closely related to atomic grouping. Something possessive like \w++ appears to match in the same way as (?> \w+),  one is just a notational convenience for the other.With possessive quantifiers, ˆ(?>\w+): can be rewritten as ˆ\w++:, and (\.\d\d(?> [1-9]?))\d+ can be rewritten as (\.\d\d[1-9]?+)\d+

    Be sure to understand the difference between (?>M) + and (?>M+). The first one throws away unused states created by M, which is not very useful since doesn’t create any states. The second one throws away unused states created by M+, which certainly can be useful.

    When phrased as a comparison between (?>M)+ and (?M+ ), it's perhaps clear that the second one is the one comparable to M++, but when converting something more complex like ("|[^"])*+  from possessive quantifiers to atomic grouping, it’s tempting to just add '?>' to the parentheses that are already there: (?>"|[ˆ"])*. The new expression might happen to achieve your goal, but be clear that is not comparable to the original possessive-quantifier version; it’s like changing  M++  to (?>M)+. Rather, to be comparable, remove the possessive plus,
    and then wrap what remains in atomic grouping: (?>("|[^"])*).

    The Backtracking of Lookaround

    It might not be apparent at first, but lookaround (introduced in Chapter 2  59) is closely related to atomic grouping and possessive quantifiers. There are four types of lookaround: positive and negative flavors of lookahead and lookbehind. They simply test whether their subexpression can and can’t match starting at the current location (lookahead), or ending at the current location (lookbehind).

    Looking a bit deeper, how does lookaround work in our NFA world of saved states and backtracking? As a subexpression within one of the lookaround constructs is being tested, it’s as if it’s in its own little world. It saves states as needed, and backtracks as necessary. If the entire subexpression is able to match successfully, what happens? With positive lookaround, the construct, as a whole, is considered a success, and with negative lookaround, it’s considered a failure. In either case, since the only concern is whether there’s a match (and we just found out that, yes, there’s a match), the “little world” of the match attempt, including any saved states that might have been left over from that attempt, is thrown away.

    What about when the subexpression within the lookaround can’t match? Since it’s being applied in its “own little world,” only states created within the current lookaround construct are available. That is, if the regex finds that it needs to backtrack further, beyond where the lookaround construct started, it’s found that the current subexpression can not match. For positive lookahead, this means failure, while for negative lookahead, it means success. In either case, there are no saved states left over (had there been, the subexpression match would not have finished), so there’s no “little world” left to throw away.

    So, we’ve seen that in all cases, once the lookaround construct has finished, there are no saved states left over from its application. Any states that might have been left over, such as in the case of successful positive lookahead, are thrown away.

    Well, where else have we seen states being thrown away? With atomic grouping and possessive quantifiers, of course.

    Please check back next week for the conclusion of this article.


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek