Java
  Home arrow Java arrow Page 3 - NFA, DFA, POSIX, and the Mechanics of Expr...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2007-01-18

    Table of Contents:
  • NFA, DFA, POSIX, and the Mechanics of Expression Processing
  • NFA, DFA, and POSIX
  • Speed and Efficiency
  • DFA versus NFA: Differences in ease of implementation

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    NFA, DFA, POSIX, and the Mechanics of Expression Processing - Speed and Efficiency


    (Page 3 of 4 )

    If efficiency is an issue with a Traditional NFA (and with backtracking, believe me, it can be), it is doubly so with a POSIX NFA since there can be so much more backtracking. A POSIX NFA engine really does have to try every possible permutation of the regex, every time. Examples in Chapter 6 show that poorly written regexes can suffer extremely severe performance penalties.

    DFA efficiency

    The text-directed DFA is a really fantastic way around all the inefficiency of backtracking. It gets its matching speed by keeping track of all possible ongoing matches at once. How does it achieve this magic?

    The DFA engine spends extra time and memory when it first sees the regular expression, before any match attempts are made, to analyze the regular expression more thoroughly (and in a different way) from an NFA. Once it starts actually attempting a match, it has an internal map describing “If I read such-and-such a character now, it will be part of this-and-that possible match.” As each character of the string is checked, the engine simply follows the map.

    Building that map can sometimes take a fair amount of time and memory, but once it is done for any particular regular expression, the results can be applied to an unlimited amount of text. It’s sort of like charging the batteries of your electric car. First, your car sits in the garage for a while, plugged into the wall, but when you actually use it, you get consistent, clean power.


    NFA: Theory Versus Reality

    The true mathematical and computational meaning of “NFA” is different from what is commonly called an “NFA regex engine.” In theory, NFA and DFA engines should match exactly the same text and have exactly the same features. In practice, the desire for richer, more expressive regular expressions has caused their semantics to diverge. An example is the support for backreferences.

    The design of a DFA engine precludes backreferences, but it’s a relatively small task to add backreference support to a true (mathematically speaking) NFA engine. In doing so, you create a more powerful tool, but you also make it decidedly nonregular (mathematically speaking). What does this mean? At most, that you should probably stop calling it an NFA, and start using the phrase “nonregular expressions,” since that describes (mathematically speaking) the new situation. No one has actually done this, so the name “NFA” has lingered, even though the implementation is no longer (mathematically speaking) an NFA.

    What does all this mean to you, as a user? Absolutely nothing. As a user, you don’t care if it’s regular,
    nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter shows you), you know all you need to care about.

    For those wishing to learn more about the theory of regular expressions, the classic computer-science text is chapter 3 of Aho, Sethi, and Ullman’s Compilers— Principles, Techniques, and Tools (Addison-Wesley, 1986), commonly called “The Dragon Book” due to the cover design. More specifically, this is the “red dragon.” The “green dragon” is its predecessor, Aho and Ullman’s Principles of Compiler Design.

     

     


     

    The work done when a regex is first seen (the once-per-regex overhead) is called compiling the regex. The map-building is what a DFA does. An NFA also builds an internal representation of the regex, but an NFA’s
    representation is like a mini program that the engine then executes.

    Summary: NFA and DFA in Comparison

    Both DFA and NFA engines have their good and bad points.

    DFA versus NFA: Differences in the pre-use compile

    Before applying a regex to a search, both types of engines compile the regex to an internal form suited to their respective match algorithms. An NFA compile is generally faster, and requires less memory. There’s no real difference between a Traditional and POSIX NFA compile.

    NFA, DEA, and POSIX

    DFA versus NFA: Differences in match speed

    For simple literal-match tests in “normal” situations, both types match at about the same rate. A DFA’s match speed is generally unrelated to the particular regex, but an NFA’s is directly related.

    A Traditional NFA must try every possible permutation of the regex before it can conclude that there’s no match. This is why I spend an entire chapter (Chapter 6) on techniques to write NFA expressions that match quickly. As we’ll see, an NFA match can sometimes take forever. If it’s a Traditional NFA, it can at least stop if and when it finds a match.

    A POSIX NFA, on the other hand, must always try every possible permutation of the regex to ensure that it has found the longest possible match, so it generally takes the same (possibly very long) amount of time to complete a successful match as it does to confirm a failure. Writing efficient expressions is doubly important for a POSIX NFA.

    In one sense, I speak a bit too strongly, since optimizations can often reduce the work needed to return an answer. We’ve already seen that an optimized engine doesn’t try ˆ-anchored regexes beyond the start of the string ( 149), and we’ll see many more optimizations in Chapter 6.

    The need for optimizations is less pressing with a DFA since its matching is so fast to begin with, but for the most part, the extra work done during the DFA’s pre-use compile affords better optimizations than most NFA engines take the trouble to do.

    Modern DFA engines often try to reduce the time and memory used during the compile by postponing some work until a match is attempted. Often, much of the compile-time work goes unused because of the nature of the text actually checked. A fair amount of time and memory can sometimes be saved by postponing the work until it’s actually needed during the match. (The technobabble term for this is lazy evaluation.) It does, however, create cases where there can be a relationship among the regex, the text being checked, and the match speed.

    DFA versus NFA: Differences in what is matched

    A DFA (or anything POSIX) ?nds the longest leftmost match. A Traditional NFA might also, or it might find something else. Any individual engine always treats the same regex/text combination in the same way, so in that sense, it’s not “random,” but other NFA engines may decide to do slightly different things. Virtually all Traditional NFA engines I’ve seen work exactly the way I’ve described here, but it’s not something absolutely guaranteed by any standard.

    DFA versus NFA: Differences in capabilities

    An NFA engine can support many things that a DFA cannot. Among them are:

    1. Capturing text matched by a parenthesized subexpression. Related features are
      backreferences and after-match information saying where in the matched text each parenthesized subexpression matched.
    2. Lookaround, and other complex zero-width assertions( 133).
    3. Non-greedy quantifiers and ordered alternation. A DFA could easily support a guaranteed shortest overall match (although for whatever reason, this option never seems to be made available to the user), but it cannot implement the local laziness and ordered alternation that we’ve talked about.
    4. Possessive quantifiers ( 142) and atomic grouping ( 139).

    DFA Speed with NFA Capabilities: Regex Nirvana?

    I’ve said several times that a DFA can’t provide capturing parentheses or backreferences. This is quite true, but it certainly doesn’t preclude hybrid
    approaches that mix technologies in an attempt to reach regex nirvana. The sidebar on page 180 told how NFAs have diverged from the theoretical straight and narrow in search of more power, and it’s only natural that the same happens with DFAs. A DFA’s construction makes it more difficult, but that doesn’t mean impossible.

    GNU grep takes a simple but effective approach. It uses a DFA when possible, reverting to an NFA when backreferences are used. GNU awk does something similar—it uses GNU grep’s fast shortest-leftmost DFA engine for simple “does it match” checks, and reverts to a different engine for checks where the actual extent of the match must be known. Since that other engine is an NFA, GNU awk can conveniently offer capturing parentheses, and it does via its special gensub function.

    Tcl’s regex engine is a true hybrid, custom built by Henry Spencer (whom you may remember having played an important part in the early development and popularization of regular expressions  88). The Tcl engine sometimes appears to be an NFA—it has lookaround, capturing parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX longest-leftmost match ( 177), and doesn’t suffer from some of the NFA problems that we’ll see in Chapter 6. It really seems quite wonderful.

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek