NFA, DFA, POSIX, and the Mechanics of Expression Processing - DFA versus NFA: Differences in ease of implementation
(Page 4 of 4 )
Although they have limitations, simple versions of DFA and NFA engines are easy enough to understand and to implement. The desire for efficiency (both in time and memory) and enhanced features drives the implementation to greater and greater complexity.
With code length as a metric, consider that the NFA regex support in the Version 7 ( January 1979) edition of ed was less than 350 lines of C code. (For that matter, the entire source for grep was a scant 478 lines.) Henry Spencer’s 1986 freely available implementation of the Version 8 regex routines was almost 1,900 lines of C, and Tom Lord’s 1992 POSIX NFA package rx (used in GNU sed, among other tools) is a stunning 9,700 lines.
For DFA implementations, the Version 7 egrep regex engine was a bit over 400 lines long, while Henry Spencer’s 1992 full-featured POSIX DFA package is over 4,500 lines long.
To provide the best of both worlds, GNU egrep Version 2.4.2 utilizes two fully functional engines (about 8,900 lines of code), and Tcl’s hybrid DFA/NFA engine (see the sidebar on the facing page) is about 9,500 lines of code.
Some implementations are simple, but that doesn’t necessarily mean they are short on features. I once wanted to use regular expressions for some text processing in Pascal. I hadn’t used Pascal since college, but it still didn’t take long to write a simple NFA regex engine. It didn’t have a lot of bells and whistles, and wasn’t built for maximum speed, but the flavor was relatively full-featured and was quite useful.
Summary
If you understood everything in this chapter the first time you read it, you probably didn’t need to read it in the first place. It’s heady stuff, to say the least. It took me quite a while to understand it, and then longer still to understand it. I hope this one concise presentation makes it easier for you. I’ve tried to keep the explanation simple without falling into the trap of oversimplication (an unfortunately all-too-common occurrence which hinders true understanding). This chapter has a lot in it, so I’ve included a lot of page references in the following summary, for when you’d like to quickly check back on something.
There are two underlying technologies commonly used to implement a regex match engine, “regex-directed NFA”( 153) and “text-directed DFA”( 155). The
abbreviations are spelled out on page 156.
Combine the two technologies with the POSIX standard ( 178), and for practical purposes, there are three types of engines:
- Traditional NFA (gas-guzzling, power-on-
demand)
- POSIX NFA (gas-guzzling, standard-
compliant)
- DFA (POSIX or not) (electric, steady-as-she-
goes)
To get the most out of a utility, you need to understand which type of engine it uses, and craft your regular expressions appropriately. The most common type is the Traditional NFA, followed by the DFA. Table 4-1 ( 145) lists a few common tools and their engine types, and the section “Testing the Engine Type” ( 146) shows how you can test the type yourself.
One overriding rule regardless of engine type: matches starting sooner take precedence over matches starting later. This is due to how the engine’s “transmission” tests the regex at each point in the string ( 148).
For the match attempt starting at any given spot:
DFA Text-Directed Engines
Find the longest possible match, period. That’s it. End of discussion ( 177). Consistent, very fast
( 179), and boring to talk about.
NFA Regex-Directed Engines
Must “work through” a match. The soul of NFA matching is backtracking ( 157, 162). The metacharacters control the match: the standard quantifiers (star and friends) are greedy ( 151), while others may be lazy or possessive ( 169). Alternation is ordered ( 174) in a traditional NFA, but greedy with a POSIX NFA.
POSIX NFA Must find the longest match, period. But, it’s not boring, as you must worry about
efficiency (the subject of Chapter 6).
Traditional NFA Is the most expressive type of regex engine, since you can use the regex-
directed nature of the engine to craft exactly the match you want.
Understanding the concepts and practices covered in this chapter is the foundation for writing correct and efficient regular expressions, which just happens to be the subject of the next two chapters.
† California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in “California” and “non-California” models.
† Actually, as we saw in the previous chapter (?128), a POSIX collating sequence can match multiple characters, but this is not common. Also, certain Unicode characters can match multiple characters when applied in a case-insensitive manner (?110), although most implementations do not support this.
† This does not mean that there can’t be some mixing of technologies to try to get the best of both worlds. This is discussed in a sidebar on page 182.
† This example uses capturing as a forum for presenting greediness, so the example itself is appropriate only for NFAs (because only NFAs support capturing). The lessons on greediness, however, apply to all engines, including the non-capturing DFA.
† I suppose I could explain the underlying theory that goes into these names, if I only knew it! As I hinted, the word deterministic is pretty important, but for the most part the theory is not relevant, so long as we understand the practical effects. By the end of this chapter, we will.
† Just for comparison, remember that a DFA doesn’t care much about the form you use to express which matches are possible; the three examples are identical to a DFA.
† With a tool or mode where a dot can match a newline, .+" applied to strings that contain multiline data matches through all the logical lines to the end of the whole string.
† A smart implementation may be able to make the possessive version a bit more efficient than its atomic-grouping counterpart (?250).
† lex has trailing context, which is exactly the same thing as zero-width positive lookahead at the end of the regex, but it can’t be generalized and put to use for embedded lookahead.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|