Home arrow Java arrow NFA, DFA, POSIX, and the Mechanics of Expression Processing

NFA, DFA, POSIX, and the Mechanics of Expression Processing

This article, the last of a multi-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 5 stars5 stars5 stars5 stars5 stars / 3
January 18, 2007
  1. · NFA, DFA, POSIX, and the Mechanics of Expression Processing
  2. · NFA, DFA, and POSIX
  3. · Speed and Efficiency
  4. · DFA versus NFA: Differences in ease of implementation

print this article

NFA, DFA, POSIX, and the Mechanics of Expression Processing
(Page 1 of 4 )

Mimicking atomic grouping with positive lookahead

It’s perhaps mostly academic for flavors that support atomic grouping, but can be quite useful for those that don’t: if you have positive lookahead, and if it supports capturing parentheses within the lookahead (most flavors do, but Tcl’s lookahead, for example, does not), you can mimic atomic grouping and possessive
quantifiers. (?>regex) can be mimicked with
regex))\1. For example, compare ˆ(?>\w+): with ˆ(?=(\w+))\1:. 

The lookahead version has \w+ greedily match as much as it can, capturing an entire word. Because it’s within lookahead, the intermediate states are thrown away when it’s finished (just as if, incidentally, it had been within atomic grouping). Unlike atomic grouping, the matched word is not included as part of the match (that’s the whole point of lookahead), but the word does remain captured. That’s a key point because it means that when \1 is applied, it’s actually being applied to the very text that filled it, and it’s certain to succeed. This extra step of applying \1 is simply to move the regex past the matched word.

This technique is a bit less efficient than real atomic grouping because of the extra time required to rematch the text via \1. But,  since states are thrown away, it fails more quickly than a raw \w+: when the : can’t match.

Is Alternation Greedy?

How alternation works is an important point because it can work in fundamentally different ways with
different regex engines. When alternation is reached, any number of the alternatives might be able to match at that point, but which will? Put another way, if more than one can match, which will? If it’s always the one that matches the most text, one might say that alternation is greedy. If it’s always the shortest amount of text, one might say it’s lazy? Which (if either) is it?

Let’s look at the Traditional NFA engine used in Perl, PHP, Java, .NET languages, and many others ( 145). When faced with alternation, each alternative is checked in the left-to-right order given in the expression. With the example regex of ˆ(Subject|Date) : ,  when the Subject|Date 
alternation is reached, the first alternative,  Subject, is attempted. If it matches, the rest of the regex (the subsequent:) is given a chance. If it turns out that it can’t match, and if other alternatives remain (in this case, Date), the regex engine backtracks to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, all alternatives) are exhausted.

More About Greediness and Backtracking

So, with that common Traditional NFA engine, what text is actually matched by tour|to|tournament when applied to the string ‘three tournaments won’? All the alternatives are attempted (and fail) during attempts starting at each character position until the transmission starts the attempt at ‘three tournaments won’. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.

So, we see that alternation is neither greedy nor lazy, but ordered, at least for a Traditional NFA. This is more powerful than greedy alternation because it allows
more control over just how a match is attempted—it allows the regex author to express “try this, then that, and finally try that, until you get a match.”

Not all flavors have ordered alternation. DFAs and POSIX NFAs do have greedy alternation, always matching with the alternative that matches the most text (tournament in this case). But, if you’re using Perl, PHP, a .NET language, java.util.regex, or any other system with a Traditional NFA engine (list  145), your alternation is ordered.

Taking Advantage of Ordered Alternation

Let’s revisit the (\.\d\d[1-9]?)\d* example from page 167. If we realize that \.\d\d[1-9]?, in effect, says “allow either \.\d\d or \.\d\d[1-9] ", we can rewrite the entire expression as (\.\d\d|\.\d\d[1-9])\d*". (There is no compelling reason to make this change—it’s merely a handy example.) Is this really the same as the original? If alternation is truly greedy, then it is, but the two are quite different with ordered alternation.

Let’s consider it as ordered for the moment. The first alternative is selected and tested, and if it matches, control passes to the \d* that follows the alternation. If there are digits remaining, the \d* matches them, including any initial non-zero digit that was the root of the original example’s problem (if you’ll recall the original problem, that’s a digit we want to match only within the parentheses, not by the \d* after the parentheses). Also, realize that if the first alternative can’t match, the second alternative will certainly not be able to, as it begins with a copy of the entire first alternative. If the first alternative doesn’t match, though, the regex engine nevertheless expends the effort for the futile attempt of the second.

Interestingly, if we swap the alternatives and use (\.\d\d[1-9]| \.\d\d)\d*, we do effectively get a replica of the original greedy (\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the trailing [1-9], the second alternative still stands a chance. It’s still ordered alternation, but now we’ve selected the order to result in a greedy-type match.

When first distributing the [1-9]? to two alternatives, in placing the shorter one first, we fashioned a non-greedy ? of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake when used with a Traditional NFA. In one book I’ve read, a*((ab)*|b*) is used as an example in explaining something about Traditional NFA regex parentheses. It’s a pointless example because the first alternative,  (ab)*, can never fail, so any other alternatives are utterly meaningless. You could add

  a*((ab)*|b*|.*|partridge in a pear tree;[a-z])

and it wouldn’t change the meaning a bit. The moral is that with ordered alternation, when more than one alternative can potentially match the same text, care must be taken when selecting the order of the alternatives.

Ordered alternation pitfalls

Ordered alternation can be put to your advantage by allowing you to craft just the match you want, but it can also lead to unexpected pitfalls for the unaware. Consider matching a January date of the form ‘Jan 31’. We need something more sophisticated than, say, Jan.[0123][0-9], as that allows "dates" as 'Jan.00''Jan.39', and disallows, 'Jan.7'.

One way to match the date part is to attack it in sections. To match from the first through the ninth, using 0?[1-9] allows a leading zero. Adding [12][0-9] allows for the tenth through the 29th, and 3[01]  rounds it out. Putting it all together, we get Jan.(0?[1-9]|[12][0-9]|3[01]).

Where do you think this matches in ‘Jan 31 is Dad’s birthday’? We want it to match ‘Jan 31’, of course, but ordered alternation actually matches only ‘Jan 3’. Surprised? During the match of the first alternative, 0?[1-9], the leading 0? fails, but the alternative matches because the subsequent  [1-9] has no trouble matching the 3. Since that’s the end of the expression, the match is complete.

When the order of the alternatives is adjusted so that the alternative that can potentially match a shorter amount of text is placed last, the problem goes away. This works: Jan.([12][0-9]|3[01]|0?[1-9]). 

Another approach is Jan.(31|[123]0|[012]?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet, a third approach is Jan.(0[1-9]|[12][0-9]?|3[01]?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I’ll leave for your free time, although the sidebar on the facing page should be helpful).


A Few Ways to Slice and Dice a Date

A few approaches to the date-matching problem posed on page 176. The calendar associated with each regex shows what can be matched by each alternative color-coded within the regex.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials