Java
  Home arrow Java arrow Page 2 - Two Engines for Expression Processing
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Two Engines for Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2007-01-04

    Table of Contents:
  • Two Engines for Expression Processing
  • Backtracking
  • A match without backtracking
  • Backtracking and Greediness

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Two Engines for Expression Processing - Backtracking


    (Page 2 of 4 )

    The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be.

    Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alternative to try, and which to leave for later).

    Whichever course of action is attempted, if it’s successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).

    A Really Crummy Analogy

    Backtracking is like leaving a pile of bread crumbs at every fork in the road. If the path you choose turns out to be a dead end, you can retrace your steps, giving up ground until you come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, you can backtrack further, retracing your steps to the next pile of crumbs, and so on, until you eventually find a path that leads to your goal, or until you run out of untried paths.

    There are various situations when the regex engine needs to choose between two (or more) options—the alternation we saw earlier is only one example. Another example is that upon reaching ...x?..., the engine must decide whether it should attempt x. Upon reaching  ...x+..., however, there is no question about trying to match x at least once—the plus
    requires at least one match, and that’s non-
    negotiable. Once the first x has been matched, though, the requirement is lifted and it then must decide to match another x.If it decides to match, it must decide if it will then attempt to match yet another... and another... and so on. At each of these many decision points, a virtual “pile of crumbs” is left behind as a reminder that another option (to match or not to match, whichever wasn’t chosen at each point) remains viable at that point.

    A crummy little example

    Let’s look at a full example using our earlier to(nite|knight|night) regex on the string 'hot tonic tonight!' (silly, yes, but good example).  The first component t, is attempted at the start of the string. It fails to match h, so the entire regex fails at that point. The engine’s transmission then bumps along to retry the regex from the second position (which also fails), and again at the third. This time the t matches, but the subsequent o fails to match because the text we’re at is now a space. So, again, the whole attempt fails.

    The attempt that eventually starts at ˙˙˙tonic˙˙˙ is more interesting. Once the to has been matched, the three alternatives become three available options. The regex engine picks one to try, remembering the others (“leaving some bread crumbs”) in case the first fails. For the purposes of discussion, let’s say that the engine first chooses nite. That expression breaks down to “n + i + ...,” which gets to ...tonic... before failing. Unlike the earlier failures, this failure doesn’t mean the end of the overall attempt because other options—the as-of-yet untried alternatives—still remain. (In our analogy, we still have piles of breadcrumbs we can return to.) The engine chooses one, we’ll say  knight, but it fails right away because k doesn’t match ‘n’. That leaves one final option, night, but it too eventually fails. Since that was the final untried option, its failure means the failure of the entire attempt starting at ...tonic..., so the transmission kicks in again.

    been matched, the three alternatives become three available options. The regex engine picks one to try, remembering the others (“leaving some bread crumbs”) in case the first fails. For the purposes of discussion, let’s say that the engine first chooses . That expression breaks down to “ ,” which gets to before failing. Unlike the earlier failures, this failure doesn’t mean the end of the overall attempt because other options—the as-of-yet untried alternatives—still remain. (In our analogy, we still have piles of breadcrumbs we can return to.) The engine chooses one, we’ll say  , but it fails right away because doesn’t match ‘’. That leaves one final option, , but it too eventually fails. Since that was the final untried option, its failure means the failure of the entire attempt starting at

    Once the engine works its way to attempt the match starting at ...tonight!, it gets interesting again. This time, the night alternative successfully matches to the end (which means an overall match, so the engine can report success at that point).

    Two Important Points on Backtracking

    The general idea of how backtracking works is fairly simple, but some of the details are quite important for real-world use. Specifically, when faced with multiple choices, which choice should be tried first? Secondly, when forced to backtrack, which saved choice should the engine use? The answer to that first question is this important principle:

    In situations where the decision is between “make an attempt” and “skip an attempt,” as with items governed by quantifiers, the engine always chooses to first make the attempt for
    greedy quantifiers, and to first skip the attempt for lazy (non-greedy) ones.

    This has far-reaching repercussions. For starters, it helps explain why the greedy quantifiers are greedy, but it doesn’t explain it completely. To complete the picture, we need to know which (among possibly many) saved options to use when we backtrack. Simply put:

    The most recently saved option is the one returned to when a local failure forces backtracking. They’re used LIFO (last in first out).

    This is easily understood in the crummy analogy—if your path becomes blocked, you simply retrace your steps until you come back across a pile of bread crumbs. The first you’ll return to is the one most recently laid. The traditional analogy for describing LIFO also holds: like stacking and unstacking dishes, the most-recently stacked will be the first unstacked.

    Saved States

    In NFA regular expression nomenclature, the piles of bread crumbs are known as saved states. A state indicates where matching can restart from, if need be. It reflects both the position in the regex and the point in the string where an untried option begins. Because this is the basis for NFA matching, let me show the implications of what I’ve already said with some simple but verbose examples. If you’re comfortable with the discussion so far, feel free to skip ahead.

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek