Java
  Home arrow Java arrow Page 2 - NFA, DFA, POSIX, and the Mechanics of Expr...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2007-01-18

    Table of Contents:
  • NFA, DFA, POSIX, and the Mechanics of Expression Processing
  • NFA, DFA, and POSIX
  • Speed and Efficiency
  • DFA versus NFA: Differences in ease of implementation

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    NFA, DFA, POSIX, and the Mechanics of Expression Processing - NFA, DFA, and POSIX


    (Page 2 of 4 )

    "The Longest-Leftmost"

    Let me repeat what I’ve said before: when the transmission starts a DFA engine from some particular point in the string, and there exists a match or matches to be found at that position, the DFA finds the longest possible match, period. Since it’s the longest from among all possible matches that start equally furthest to the left, it’s the “longest-leftmost” match.

    Really, the longest

    Issues of which match is longest aren’t confined to alternation. Consider how an NFA matches the (horribly contrived) one(self)?(selfsufficient)?  against the string oneselfsufficient. An NFA first matches  one and then the greedy (self)?,  leaving (selfsufficient)? left to try against sufficient. It doesn’t match, but that’s OK since it is optional. So, the Traditional NFA returns oneselfsufficient and discards the untried states. (A POSIX NFA is another story that we’ll get to shortly.)

    On the other hand, a DFA finds the longer oneselfsufficient. An NFA would also find that match if the initial (self)? were to somehow go unmatched, as that would leave (selfsufficient)? then able to match. A Traditional NFA doesn’t do that, but the DFA finds it nevertheless, since it’s the longest possible match available to the current attempt. It can do this because it keeps track of all matches simultaneously, and knows at all times about all possible matches.

    I chose this silly example because it’s easy to talk about, but I want you to realize that this issue is important in real life. For example, consider trying to match continuation lines. It’s not uncommon for a data specification to allow one logical line to extend across multiple real lines if the real lines end with a backslash before the newline. As an example, consider the following:

      SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \
              missing.c msg.c node.c re.c version.c

    You might normally want to use ˆ\w+=.* to match this kind of “var = value” assignment line, but this regex doesn’t consider the continuation lines. (I’m assuming for the example that the tool’s dot won’t match a newline.) To match continuation lines, you might consider appending (\\\n.*)* to the regex, yielding ˆ\w+=.*(\\\n.*)*. Ostensibly, this says that any number of additional logical lines are allowed so long as they each follow an escaped newline. This seems reasonable, but it will never work with a traditional NFA. By the time the original .* has reached the newline, it has already passed the backslash, and nothing in what was added forces it to backtrack
    ( 152). Yet, a DFA finds the longer multiline match if available, simply because it is, indeed, the longest.

    If you have lazy quantifiers available, you might consider using them, such as with ˆ\w+=.*?(\\\n.*?)*.  This allows the escaped newline part to be tested each time before the first dot actually matches anything, so the thought is that the \\ then gets to match the backslash before the newline. Again, this won’t work. A lazy quantifier actually ends up matching something optional only when forced to do so, but in this case, everything after the is optional, so there’s nothing to force the lazy quantifiers to match anything. Our lazy example matches only ‘SRC=',  so it’s certainly not the answer.

    There are other approaches to solving this problem; we’ll continue with this example in the next chapter ( 186).

    POSIX and the Longest-Leftmost Rule

    The POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text must be the one returned.

    The POSIX standard document uses the phrase “longest of the leftmost.” It doesn’t say you have to use a DFA, so if you want to use an NFA when creating a POSIX tool, what’s a programmer to do? If you want to implement a POSIX NFA, you’d have to find the full oneselfsufficient and all the continuation lines, despite these results being “unnatural” to your NFA.

    A Traditional NFA engine stops with the first match it finds, but what if it were to continue to try options (states) that might remain? Each time it reached the end of the regex, it would have another plausible match. By the time all options are exhausted, it could simply report the longest of the plausible matches it had found. Thus, a POSIX NFA.

    An NFA applied to the first example would, in matching (self)?, have saved an option noting that it could pick up matching one(self)?(selfsufficient)? at
    oneselfsufficient. Even after finding the oneselfsufficient that a Traditional NFA stops at, a POSIX NFA continues to exhaustively check the remaining options, eventually realizing that yes, there is a way to match the longer (and in fact, longest) oneselfsufficient.

    In Chapter 7, we’ll see a method to trick Perl into mimicking POSIX semantics, having it report the longest match ( 335).

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 6 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek