Java
  Home arrow Java arrow Page 4 - NFA, DFA, POSIX, and the Mechanics of Expr...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2007-01-18

    Table of Contents:
  • NFA, DFA, POSIX, and the Mechanics of Expression Processing
  • NFA, DFA, and POSIX
  • Speed and Efficiency
  • DFA versus NFA: Differences in ease of implementation

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    NFA, DFA, POSIX, and the Mechanics of Expression Processing - DFA versus NFA: Differences in ease of implementation


    (Page 4 of 4 )

     

    Although they have limitations, simple versions of DFA and NFA engines are easy enough to understand and to implement. The desire for efficiency (both in time and memory) and enhanced features drives the implementation to greater and greater complexity.

    With code length as a metric, consider that the NFA regex support in the Version 7 ( January 1979) edition of ed was less than 350 lines of C code. (For that matter, the entire source for grep was a scant 478 lines.) Henry Spencer’s 1986 freely available implementation of the Version 8 regex routines was almost 1,900 lines of C, and Tom Lord’s 1992 POSIX NFA package rx (used in GNU sed, among other tools) is a stunning 9,700 lines.

    For DFA implementations, the Version 7 egrep regex engine was a bit over 400 lines long, while Henry Spencer’s 1992 full-featured POSIX DFA package is over 4,500 lines long.

    To provide the best of both worlds, GNU egrep Version 2.4.2 utilizes two fully functional engines (about 8,900 lines of code), and Tcl’s hybrid DFA/NFA engine (see the sidebar on the facing page) is about 9,500 lines of code.

    Some implementations are simple, but that doesn’t necessarily mean they are short on features. I once wanted to use regular expressions for some text processing in Pascal. I hadn’t used Pascal since college, but it still didn’t take long to write a simple NFA regex engine. It didn’t have a lot of bells and whistles, and wasn’t built for maximum speed, but the flavor was relatively full-featured and was quite useful.

    Summary

    If you understood everything in this chapter the first time you read it, you probably didn’t need to read it in the first place. It’s heady stuff, to say the least. It took me quite a while to understand it, and then longer still to understand it. I hope this one concise presentation makes it easier for you. I’ve tried to keep the explanation simple without falling into the trap of oversimplication (an unfortunately all-too-common occurrence which hinders true understanding). This chapter has a lot in it, so I’ve included a lot of page references in the following summary, for when you’d like to quickly check back on something.

    There are two underlying technologies commonly used to implement a regex match engine, “regex-directed NFA”( 153) and “text-directed DFA”( 155). The
    abbreviations are spelled out on page 156.

    Combine the two technologies with the POSIX standard ( 178), and for practical purposes, there are three types of engines:

    1. Traditional NFA          (gas-guzzling, power-on-
                                        demand)
    2. POSIX NFA                 (gas-guzzling, standard-
                                        compliant)
    3. DFA (POSIX or not)    (electric, steady-as-she-
                                        goes)

    To get the most out of a utility, you need to understand which type of engine it uses, and craft your regular expressions appropriately. The most common type is the Traditional NFA, followed by the DFA. Table 4-1 ( 145) lists a few common tools and their engine types, and the section “Testing the Engine Type”  ( 146) shows how you can test the type yourself.

    One overriding rule regardless of engine type: matches starting sooner take precedence over matches starting later. This is due to how the engine’s “transmission” tests the regex at each point in the string ( 148).

    For the match attempt starting at any given spot:

    DFA Text-Directed Engines

    Find the longest possible match, period. That’s it. End of discussion ( 177). Consistent, very fast
    ( 179), and boring to talk about.

    NFA Regex-Directed Engines

    Must “work through” a match. The soul of NFA matching is backtracking ( 157, 162). The metacharacters control the match: the standard quantifiers (star and friends) are greedy ( 151), while others may be lazy or possessive ( 169). Alternation is ordered ( 174) in a traditional NFA, but greedy with a POSIX NFA.

    POSIX NFA   Must find the longest match, period. But, it’s not boring, as you must worry about
    efficiency (the subject of Chapter 6).

    Traditional NFA   Is the most expressive type of regex engine, since you can use the regex-
    directed nature of the engine to craft exactly the match you want.

    Understanding the concepts and practices covered in this chapter is the foundation for writing correct and efficient regular expressions, which just happens to be the subject of the next two chapters.


    † California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in “California” and “non-California” models.

    † Actually, as we saw in the previous chapter (?128), a POSIX collating sequence can match multiple characters, but this is not common. Also, certain Unicode characters can match multiple characters when applied in a case-insensitive manner (?110), although most implementations do not support this.

    † This does not mean that there can’t be some mixing of technologies to try to get the best of both worlds. This is discussed in a sidebar on page 182.

    † This example uses capturing as a forum for presenting greediness, so the example itself is appropriate only for NFAs (because only NFAs support capturing). The lessons on greediness, however, apply to all engines, including the non-capturing DFA.

    † I suppose I could explain the underlying theory that goes into these names, if I only knew it! As I hinted, the word deterministic is pretty important, but for the most part the theory is not relevant, so long as we understand the practical effects. By the end of this chapter, we will.

    † Just for comparison, remember that a DFA doesn’t care much about the form you use to express which matches are possible; the three examples are identical to a DFA.

    † With a tool or mode where a dot can match a newline, .+" applied to strings that contain multiline data matches through all the logical lines to the end of the whole string.

    † A smart implementation may be able to make the possessive version a bit more efficient than its atomic-grouping counterpart (?250).

    lex has trailing context, which is exactly the same thing as zero-width positive lookahead at the end of the regex, but it can’t be generalized and put to use for embedded lookahead.


    DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 2 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek