The Mechanics of Expression Processing - Regex Engine Types
(Page 2 of 4 )
There are two fundamentally different types of regex engines: one called “DFA” (the electric engine of our story) and one called “NFA” (the gas engine). The details of what NFA and DFA mean follow shortly ( 156), but for now just consider them names, like Bill and Ted. Or electric and gas.
Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include the .NET languages, PHP, Ruby, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Some systems have a multi-engine hybrid system, using the most appropriate engine for the job (or even one that swaps between engines for different parts of the same regex, as needed to get the best combination of features and speed). Table 4-1 lists a few common programs and the regex engine that most versions use. If your favorite program is not in the list, the section “Testing the Engine Type” on the next page can help you ?nd out which it is.
Table 4-1: Some Tools and Their Regex Engines
| Engine type | Programs |
| DFA | awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail |
| Traditional NFA | GNU Emacs, Java, grep (most versions), less, more, .NET languages, |
| | PCRE library, Perl, PHP (all three regex suites), Python, Ruby, |
| | sed (most versions), vi |
| POSIX NFA | mawk, Mortice Kern Systems’ utilities, GNU Emacs (when requested) |
| Hybrid NFA/DFA | GNU awk, GNU grep/ egrep,Tcl |
As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by clearly specifying not only which metacharacters and features an engine should support, as mentioned in the previous chapter, but also exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to this new standard, but the kind of results an NFA traditionally provided were different, so changes were needed. As a result, broadly speaking, there are three types of regex engines:
- DFA (POSIX or not—similar either way)
- Traditional NFA (most common: Perl, .NET, PHP, Java, Python, ...)
- POSIX NFA
Here, we use “POSIX” to refer to the match semantics —the expected operation of a regex that the POSIX standard specifies (which we’ll get to later in this chapter). Don’t confuse this use of “POSIX” with uses surrounding regex features introduced in that same standard ( 127). Many programs support the features without supporting the full match semantics.
Old (and minimally featured) programs like egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just
confirmed the status quo—no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests (POSIX NFA)
were fine in that they produced results according to the standard, but the necessary changes only increased how fickle they were to proper tuning.
Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be “good enough” now causes knocks and pings. But, so long as you know how to maintain your baby, the engine runs smoothly and cleanly.
From the Department of Redundancy Department At this point, I’ll ask you to go back and review the story about engines. Every sentence there rings with some truth about regular expressions. A second reading should raise some questions. Particularly, what does it mean that an electric DFA regex engine more or less “just runs?” What affects a gas-powered NFA? How can I tune my regular expressions to run as I want on an NFA? What special concerns does an emissions-controlled POSIX NFA have? What’s a “stalled engine” in the regex world?
Testing the Engine Type Because the type of engine used in a tool influences the type of features it can offer, and how those features appear to work, we can often learn the type of engine a tool has merely by checking to see how it handles a few test expressions. (After all, if you can’t tell the difference, the difference doesn’t matter.) At this point in the book, I wouldn’t expect you to understand why the following test results indicate what they do, but I want to offer these tests now so that if your favorite tool is not listed in Table 4-1, you can investigate before continuing with this and the subsequent chapters.
Traditional NFA or not?
The most commonly used engine is a Traditional NFA, and spotting it is easy. First, are lazy quantifiers ( 141) supported? If so, it’s almost certainly a Traditional NFA. As we’ll see, lazy quantifiers are not possible with a DFA, nor would they make any sense with a POSIX NFA. However, to make sure, simply apply the regex nfa|nfa not to the string ‘nfa not’—if only ‘nfa’ matches, it’s a Traditional NFA. If the entire ‘nfa not’ matches, it’s either a POSIX NFA or a DFA.
DFA or POSIX NFA?
Differentiating between a POSIX NFA and a DFA is sometimes just as simple. Capturing parentheses and backreferences are not supported by a DFA, so that can be one hint, but there are systems that are a hybrid mix between the two engine types, and so may end up using a DFA if there are no capturing parentheses.
Here’s a simple test that can tell you a lot. Apply X(.+)+X to a string like
‘= XX======================’, as with this egrep command:
echo =XX================================= | egrep ’X(.+)+X’
If it takes a long time to finish, it’s an NFA (and if not a Traditional NFA as per the test in the previous section, it must be a POSIX NFA). If it finishes quickly, it’s either a DFA or an NFA with some advanced optimization. Does it display a warning message about a stack
overflow or long match aborted? If so, it’s an NFA.
Match Basics Before looking at the differences among these engine types, let’s first look at their similarities. Certain aspects of the drive train are the same (or for all practical purposes appear to be the same), so these examples can cover all engine types.
About the ExamplesThis chapter is primarily concerned with a generic, full-function regex engine, so some tools won’t support exactly everything presented. In my examples, the dipstick might be to the left of the oil filter, while under your hood it might be behind the distributor cap. Your goal is to understand the concepts so that you can drive and maintain your favorite regex package (and ones you find interest in later).
I’ll continue to use Perl’s notation for most of the examples, although I’ll occasionally show others to remind you that the notation is superficial and that the issues under discussion transcend any one tool or
flavor. To cut down on wordiness here, I’ll rely on you to check Chapter 3 ( 114) if I use an unfamiliar construct.
This chapter details the practical effects of how a match is carried out. It would be nice if everything could be distilled down to a few simple rules that could be memorized without needing to understand what is going on. Unfortunately, that’s not the case. In fact, with all this chapter offers, I identify only two all-encompassing rules:
- The match that begins earliest (leftmost) wins.
- The standard quantifiers (* , +, ? , and {m,n} ) are greedy.
We’ll look at these rules, their effects, and much more throughout this chapter. Let’s start by diving into the details of the first rule.
Next: Rule 1: The Match That Begins Earliest Wins >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|