NFA, DFA, POSIX, and the Mechanics of Expression Processing (Page 1 of 4 )
Mimicking atomic grouping with positive lookahead
It’s perhaps mostly academic for flavors that support atomic grouping, but can be quite useful for those that don’t: if you have positive lookahead, and if it supports capturing parentheses within the lookahead (most flavors do, but Tcl’s lookahead, for example, does not), you can mimic atomic grouping and possessive quantifiers. (?>regex) can be mimicked with (?=(regex))\1. For example, compare ˆ(?>\w+): with ˆ(?=(\w+))\1:.
The lookahead version has \w+ greedily match as much as it can, capturing an entire word. Because it’s within lookahead, the intermediate states are thrown away when it’s finished (just as if, incidentally, it had been within atomic grouping). Unlike atomic grouping, the matched word is not included as part of the match (that’s the whole point of lookahead), but the word does remain captured. That’s a key point because it means that when \1 is applied, it’s actually being applied to the very text that filled it, and it’s certain to succeed. This extra step of applying \1 is simply to move the regex past the matched word.
This technique is a bit less efficient than real atomic grouping because of the extra time required to rematch the text via \1. But, since states are thrown away, it fails more quickly than a raw \w+: when the : can’t match.
Is Alternation Greedy?
How alternation works is an important point because it can work in fundamentally different ways with different regex engines. When alternation is reached, any number of the alternatives might be able to match at that point, but which will? Put another way, if more than one can match, which will? If it’s always the one that matches the most text, one might say that alternation is greedy. If it’s always the shortest amount of text, one might say it’s lazy? Which (if either) is it?
Let’s look at the Traditional NFA engine used in Perl, PHP, Java, .NET languages, and many others ( 145). When faced with alternation, each alternative is checked in the left-to-right order given in the expression. With the example regex of ˆ(Subject|Date) : , when the Subject|Date alternation is reached, the first alternative, Subject, is attempted. If it matches, the rest of the regex (the subsequent:) is given a chance. If it turns out that it can’t match, and if other alternatives remain (in this case, Date), the regex engine backtracks to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, all alternatives) are exhausted.
More About Greediness and Backtracking
So, with that common Traditional NFA engine, what text is actually matched by tour|to|tournament when applied to the string ‘three tournaments won’? All the alternatives are attempted (and fail) during attempts starting at each character position until the transmission starts the attempt at ‘three tournaments won’. This time, the first alternative, tour, matches. Since the alternation is the last thing in the regex, the moment the tour matches, the whole regex is done. The other alternatives are not even tried again.
So, we see that alternation is neither greedy nor lazy, but ordered, at least for a Traditional NFA. This is more powerful than greedy alternation because it allows more control over just how a match is attempted—it allows the regex author to express “try this, then that, and finally try that, until you get a match.”
Not all flavors have ordered alternation. DFAs and POSIX NFAs do have greedy alternation, always matching with the alternative that matches the most text (tournament in this case). But, if you’re using Perl, PHP, a .NET language, java.util.regex, or any other system with a Traditional NFA engine (list 145), your alternation is ordered.
Taking Advantage of Ordered Alternation
Let’s revisit the (\.\d\d[1-9]?)\d* example from page 167. If we realize that \.\d\d[1-9]?, in effect, says “allow either \.\d\d or \.\d\d[1-9] ", we can rewrite the entire expression as (\.\d\d|\.\d\d[1-9])\d*". (There is no compelling reason to make this change—it’s merely a handy example.) Is this really the same as the original? If alternation is truly greedy, then it is, but the two are quite different with ordered alternation.
Let’s consider it as ordered for the moment. The first alternative is selected and tested, and if it matches, control passes to the \d* that follows the alternation. If there are digits remaining, the \d* matches them, including any initial non-zero digit that was the root of the original example’s problem (if you’ll recall the original problem, that’s a digit we want to match only within the parentheses, not by the \d* after the parentheses). Also, realize that if the first alternative can’t match, the second alternative will certainly not be able to, as it begins with a copy of the entire first alternative. If the first alternative doesn’t match, though, the regex engine nevertheless expends the effort for the futile attempt of the second.
Interestingly, if we swap the alternatives and use (\.\d\d[1-9]| \.\d\d)\d*, we do effectively get a replica of the original greedy (\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the trailing [1-9], the second alternative still stands a chance. It’s still ordered alternation, but now we’ve selected the order to result in a greedy-type match.
When first distributing the [1-9]? to two alternatives, in placing the shorter one first, we fashioned a non-greedy ? of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake when used with a Traditional NFA. In one book I’ve read, a*((ab)*|b*) is used as an example in explaining something about Traditional NFA regex parentheses. It’s a pointless example because the first alternative, (ab)*, can never fail, so any other alternatives are utterly meaningless. You could add
a*((ab)*|b*|.*|partridge in a pear tree;[a-z])
and it wouldn’t change the meaning a bit. The moral is that with ordered alternation, when more than one alternative can potentially match the same text, care must be taken when selecting the order of the alternatives.
Ordered alternation pitfalls
Ordered alternation can be put to your advantage by allowing you to craft just the match you want, but it can also lead to unexpected pitfalls for the unaware. Consider matching a January date of the form ‘Jan 31’. We need something more sophisticated than, say, Jan.[0-9], as that allows "dates" as 'Jan.00', 'Jan.39', and disallows, 'Jan.7'.
One way to match the date part is to attack it in sections. To match from the first through the ninth, using 0?[1-9] allows a leading zero. Adding [0-9] allows for the tenth through the 29th, and 3 rounds it out. Putting it all together, we get Jan.(0?[1-9]|[0-9]|3).
Where do you think this matches in ‘Jan 31 is Dad’s birthday’? We want it to match ‘Jan 31’, of course, but ordered alternation actually matches only ‘Jan 3’. Surprised? During the match of the first alternative, 0?[1-9], the leading 0? fails, but the alternative matches because the subsequent [1-9] has no trouble matching the 3. Since that’s the end of the expression, the match is complete.
When the order of the alternatives is adjusted so that the alternative that can potentially match a shorter amount of text is placed last, the problem goes away. This works: Jan.([0-9]|3|0?[1-9]).
Another approach is Jan.(31|0|?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet, a third approach is Jan.(0[1-9]|[0-9]?|3?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I’ll leave for your free time, although the sidebar on the facing page should be helpful).
NFA, DFA, and POSIX
A Few Ways to Slice and Dice a Date
A few approaches to the date-matching problem posed on page 176. The calendar associated with each regex shows what can be matched by each alternative color-coded within the regex.