Java
  Home arrow Java arrow NFA, DFA, POSIX, and the Mechanics of Expr...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Moblin 
JMSL Numerical Library 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2007-01-18

    Table of Contents:
  • NFA, DFA, POSIX, and the Mechanics of Expression Processing
  • NFA, DFA, and POSIX
  • Speed and Efficiency
  • DFA versus NFA: Differences in ease of implementation

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    NFA, DFA, POSIX, and the Mechanics of Expression Processing


    (Page 1 of 4 )

    This article, the last of a multi-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

    Mimicking atomic grouping with positive lookahead

    It’s perhaps mostly academic for flavors that support atomic grouping, but can be quite useful for those that don’t: if you have positive lookahead, and if it supports capturing parentheses within the lookahead (most flavors do, but Tcl’s lookahead, for example, does not), you can mimic atomic grouping and possessive
    quantifiers. (?>regex) can be mimicked with
    (?=(
    regex))\1. For example, compare ˆ(?>\w+): with ˆ(?=(\w+))\1:. 

    The lookahead version has \w+ greedily match as much as it can, capturing an entire word. Because it’s within lookahead, the intermediate states are thrown away when it’s finished (just as if, incidentally, it had been within atomic grouping). Unlike atomic grouping, the matched word is not included as part of the match (that’s the whole point of lookahead), but the word does remain captured. That’s a key point because it means that when \1 is applied, it’s actually being applied to the very text that filled it, and it’s certain to succeed. This extra step of applying \1 is simply to move the regex past the matched word.

    This technique is a bit less efficient than real atomic grouping because of the extra time required to rematch the text via \1. But,  since states are thrown away, it fails more quickly than a raw \w+: when the : can’t match.

    Is Alternation Greedy?

    How alternation works is an important point because it can work in fundamentally different ways with
    different regex engines. When alternation is reached, any number of the alternatives might be able to match at that point, but which will? Put another way, if more than one can match, which will? If it’s always the one that matches the most text, one might say that alternation is greedy. If it’s always the shortest amount of text, one might say it’s lazy? Which (if either) is it?

    Let’s look at the Traditional NFA engine used in Perl, PHP, Java, .NET languages, and many others ( 145). When faced with alternation, each alternative is checked in the left-to-right order given in the expression. With the example regex of ˆ(Subject|Date) : ,  when the Subject|Date 
    alternation is reached, the first alternative,  Subject, is attempted. If it matches, the rest of the regex (the subsequent:) is given a chance. If it turns out that it can’t match, and if other alternatives remain (in this case, Date), the regex engine backtracks to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, all alternatives) are exhausted.

    More About Greediness and Backtracking

    So, with that common Traditional NFA engine, what text is actually matched by tour|to|tournament when applied to the string ‘three tournaments won’? All the alternatives are attempted (and fail) during attempts starting at each character position until the transmission starts the attempt at ‘three tournaments won’. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

    So, we see that alternation is neither greedy nor lazy, but ordered, at least for a Traditional NFA. This is more powerful than greedy alternation because it allows
    more control over just how a match is attempted—it allows the regex author to express “try this, then that, and finally try that, until you get a match.”

    Not all flavors have ordered alternation. DFAs and POSIX NFAs do have greedy alternation, always matching with the alternative that matches the most text (tournament in this case). But, if you’re using Perl, PHP, a .NET language, java.util.regex, or any other system with a Traditional NFA engine (list  145), your alternation is ordered.

    Taking Advantage of Ordered Alternation

    Let’s revisit the (\.\d\d[1-9]?)\d* example from page 167. If we realize that \.\d\d[1-9]?, in effect, says “allow either \.\d\d or \.\d\d[1-9] ", we can rewrite the entire expression as (\.\d\d|\.\d\d[1-9])\d*". (There is no compelling reason to make this change—it’s merely a handy example.) Is this really the same as the original? If alternation is truly greedy, then it is, but the two are quite different with ordered alternation.

    Let’s consider it as ordered for the moment. The first alternative is selected and tested, and if it matches, control passes to the \d* that follows the alternation. If there are digits remaining, the \d* matches them, including any initial non-zero digit that was the root of the original example’s problem (if you’ll recall the original problem, that’s a digit we want to match only within the parentheses, not by the \d* after the parentheses). Also, realize that if the first alternative can’t match, the second alternative will certainly not be able to, as it begins with a copy of the entire first alternative. If the first alternative doesn’t match, though, the regex engine nevertheless expends the effort for the futile attempt of the second.

    Interestingly, if we swap the alternatives and use (\.\d\d[1-9]| \.\d\d)\d*, we do effectively get a replica of the original greedy (\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the trailing [1-9], the second alternative still stands a chance. It’s still ordered alternation, but now we’ve selected the order to result in a greedy-type match.

    When first distributing the [1-9]? to two alternatives, in placing the shorter one first, we fashioned a non-greedy ? of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake when used with a Traditional NFA. In one book I’ve read, a*((ab)*|b*) is used as an example in explaining something about Traditional NFA regex parentheses. It’s a pointless example because the first alternative,  (ab)*, can never fail, so any other alternatives are utterly meaningless. You could add

      a*((ab)*|b*|.*|partridge in a pear tree;[a-z])

    and it wouldn’t change the meaning a bit. The moral is that with ordered alternation, when more than one alternative can potentially match the same text, care must be taken when selecting the order of the alternatives.

    Ordered alternation pitfalls

    Ordered alternation can be put to your advantage by allowing you to craft just the match you want, but it can also lead to unexpected pitfalls for the unaware. Consider matching a January date of the form ‘Jan 31’. We need something more sophisticated than, say, Jan.[0123][0-9], as that allows "dates" as 'Jan.00''Jan.39', and disallows, 'Jan.7'.

    One way to match the date part is to attack it in sections. To match from the first through the ninth, using 0?[1-9] allows a leading zero. Adding [12][0-9] allows for the tenth through the 29th, and 3[01]  rounds it out. Putting it all together, we get Jan.(0?[1-9]|[12][0-9]|3[01]).

    Where do you think this matches in ‘Jan 31 is Dad’s birthday’? We want it to match ‘Jan 31’, of course, but ordered alternation actually matches only ‘Jan 3’. Surprised? During the match of the first alternative, 0?[1-9], the leading 0? fails, but the alternative matches because the subsequent  [1-9] has no trouble matching the 3. Since that’s the end of the expression, the match is complete.

    When the order of the alternatives is adjusted so that the alternative that can potentially match a shorter amount of text is placed last, the problem goes away. This works: Jan.([12][0-9]|3[01]|0?[1-9]). 

    Another approach is Jan.(31|[123]0|[012]?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet, a third approach is Jan.(0[1-9]|[12][0-9]?|3[01]?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I’ll leave for your free time, although the sidebar on the facing page should be helpful).

    NFA, DFA, and POSIX 


    A Few Ways to Slice and Dice a Date

    A few approaches to the date-matching problem posed on page 176. The calendar associated with each regex shows what can be matched by each alternative color-coded within the regex.


    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway