Home arrow Java arrow Page 4 - NFA, DFA, POSIX, and the Mechanics of Expression Processing
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing


This article, the last of a multi-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 5 stars5 stars5 stars5 stars5 stars / 1
January 18, 2007
TABLE OF CONTENTS:
  1. · NFA, DFA, POSIX, and the Mechanics of Expression Processing
  2. · NFA, DFA, and POSIX
  3. · Speed and Efficiency
  4. · DFA versus NFA: Differences in ease of implementation

print this article
SEARCH DEVARTICLES

TOOLS YOU CAN USE

advertisement
NFA, DFA, POSIX, and the Mechanics of Expression Processing - DFA versus NFA: Differences in ease of implementation
(Page 4 of 4 )

 

Although they have limitations, simple versions of DFA and NFA engines are easy enough to understand and to implement. The desire for efficiency (both in time and memory) and enhanced features drives the implementation to greater and greater complexity.

With code length as a metric, consider that the NFA regex support in the Version 7 ( January 1979) edition of ed was less than 350 lines of C code. (For that matter, the entire source for grep was a scant 478 lines.) Henry Spencer’s 1986 freely available implementation of the Version 8 regex routines was almost 1,900 lines of C, and Tom Lord’s 1992 POSIX NFA package rx (used in GNU sed, among other tools) is a stunning 9,700 lines.

For DFA implementations, the Version 7 egrep regex engine was a bit over 400 lines long, while Henry Spencer’s 1992 full-featured POSIX DFA package is over 4,500 lines long.

To provide the best of both worlds, GNU egrep Version 2.4.2 utilizes two fully functional engines (about 8,900 lines of code), and Tcl’s hybrid DFA/NFA engine (see the sidebar on the facing page) is about 9,500 lines of code.

Some implementations are simple, but that doesn’t necessarily mean they are short on features. I once wanted to use regular expressions for some text processing in Pascal. I hadn’t used Pascal since college, but it still didn’t take long to write a simple NFA regex engine. It didn’t have a lot of bells and whistles, and wasn’t built for maximum speed, but the flavor was relatively full-featured and was quite useful.

Summary

If you understood everything in this chapter the first time you read it, you probably didn’t need to read it in the first place. It’s heady stuff, to say the least. It took me quite a while to understand it, and then longer still to understand it. I hope this one concise presentation makes it easier for you. I’ve tried to keep the explanation simple without falling into the trap of oversimplication (an unfortunately all-too-common occurrence which hinders true understanding). This chapter has a lot in it, so I’ve included a lot of page references in the following summary, for when you’d like to quickly check back on something.

There are two underlying technologies commonly used to implement a regex match engine, “regex-directed NFA”( 153) and “text-directed DFA”( 155). The
abbreviations are spelled out on page 156.

Combine the two technologies with the POSIX standard ( 178), and for practical purposes, there are three types of engines:

  1. Traditional NFA          (gas-guzzling, power-on-
                                      demand)
  2. POSIX NFA                 (gas-guzzling, standard-
                                      compliant)
  3. DFA (POSIX or not)    (electric, steady-as-she-
                                      goes)

To get the most out of a utility, you need to understand which type of engine it uses, and craft your regular expressions appropriately. The most common type is the Traditional NFA, followed by the DFA. Table 4-1 ( 145) lists a few common tools and their engine types, and the section “Testing the Engine Type”  ( 146) shows how you can test the type yourself.

One overriding rule regardless of engine type: matches starting sooner take precedence over matches starting later. This is due to how the engine’s “transmission” tests the regex at each point in the string ( 148).

For the match attempt starting at any given spot:

DFA Text-Directed Engines

Find the longest possible match, period. That’s it. End of discussion ( 177). Consistent, very fast
( 179), and boring to talk about.

NFA Regex-Directed Engines

Must “work through” a match. The soul of NFA matching is backtracking ( 157, 162). The metacharacters control the match: the standard quantifiers (star and friends) are greedy ( 151), while others may be lazy or possessive ( 169). Alternation is ordered ( 174) in a traditional NFA, but greedy with a POSIX NFA.

POSIX NFA   Must find the longest match, period. But, it’s not boring, as you must worry about
efficiency (the subject of Chapter 6).

Traditional NFA   Is the most expressive type of regex engine, since you can use the regex-
directed nature of the engine to craft exactly the match you want.

Understanding the concepts and practices covered in this chapter is the foundation for writing correct and efficient regular expressions, which just happens to be the subject of the next two chapters.


† California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in “California” and “non-California” models.

† Actually, as we saw in the previous chapter (?128), a POSIX collating sequence can match multiple characters, but this is not common. Also, certain Unicode characters can match multiple characters when applied in a case-insensitive manner (?110), although most implementations do not support this.

† This does not mean that there can’t be some mixing of technologies to try to get the best of both worlds. This is discussed in a sidebar on page 182.

† This example uses capturing as a forum for presenting greediness, so the example itself is appropriate only for NFAs (because only NFAs support capturing). The lessons on greediness, however, apply to all engines, including the non-capturing DFA.

† I suppose I could explain the underlying theory that goes into these names, if I only knew it! As I hinted, the word deterministic is pretty important, but for the most part the theory is not relevant, so long as we understand the practical effects. By the end of this chapter, we will.

† Just for comparison, remember that a DFA doesn’t care much about the form you use to express which matches are possible; the three examples are identical to a DFA.

† With a tool or mode where a dot can match a newline, .+" applied to strings that contain multiline data matches through all the logical lines to the end of the whole string.

† A smart implementation may be able to make the possessive version a bit more efficient than its atomic-grouping counterpart (?250).

lex has trailing context, which is exactly the same thing as zero-width positive lookahead at the end of the regex, but it can’t be generalized and put to use for embedded lookahead.


DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

blog comments powered by Disqus
JAVA ARTICLES

- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...
- Generics and Limitations in Java
- Getting Started with Java Web Development in...

Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 



© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 5 - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials