NFA, DFA, POSIX, and the Mechanics of Expression Processing - NFA, DFA, and POSIX
(Page 2 of 4 )
"The Longest-Leftmost"
Let me repeat what I’ve said before: when the transmission starts a DFA engine from some particular point in the string, and there exists a match or matches to be found at that position, the DFA finds the longest possible match, period. Since it’s the longest from among all possible matches that start equally furthest to the left, it’s the “longest-leftmost” match.
Really, the longest
Issues of which match is longest aren’t confined to alternation. Consider how an NFA matches the (horribly contrived) one(self)?(selfsufficient)? against the string oneselfsufficient. An NFA first matches one and then the greedy (self)?, leaving (selfsufficient)? left to try against sufficient. It doesn’t match, but that’s OK since it is optional. So, the Traditional NFA returns oneselfsufficient and discards the untried states. (A POSIX NFA is another story that we’ll get to shortly.)
On the other hand, a DFA finds the longer oneselfsufficient. An NFA would also find that match if the initial (self)? were to somehow go unmatched, as that would leave (selfsufficient)? then able to match. A Traditional NFA doesn’t do that, but the DFA finds it nevertheless, since it’s the longest possible match available to the current attempt. It can do this because it keeps track of all matches simultaneously, and knows at all times about all possible matches.
I chose this silly example because it’s easy to talk about, but I want you to realize that this issue is important in real life. For example, consider trying to match continuation lines. It’s not uncommon for a data specification to allow one logical line to extend across multiple real lines if the real lines end with a backslash before the newline. As an example, consider the following:
SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \
missing.c msg.c node.c re.c version.c
You might normally want to use ˆ\w+=.* to match this kind of “var = value” assignment line, but this regex doesn’t consider the continuation lines. (I’m assuming for the example that the tool’s dot won’t match a newline.) To match continuation lines, you might consider appending (\\\n.*)* to the regex, yielding ˆ\w+=.*(\\\n.*)*. Ostensibly, this says that any number of additional logical lines are allowed so long as they each follow an escaped newline. This seems reasonable, but it will never work with a traditional NFA. By the time the original .* has reached the newline, it has already passed the backslash, and nothing in what was added forces it to backtrack
( 152). Yet, a DFA finds the longer multiline match if available, simply because it is, indeed, the longest.
If you have lazy quantifiers available, you might consider using them, such as with ˆ\w+=.*?(\\\n.*?)*. This allows the escaped newline part to be tested each time before the first dot actually matches anything, so the thought is that the \\ then gets to match the backslash before the newline. Again, this won’t work. A lazy quantifier actually ends up matching something optional only when forced to do so, but in this case, everything after the = is optional, so there’s nothing to force the lazy quantifiers to match anything. Our lazy example matches only ‘SRC=', so it’s certainly not the answer.
There are other approaches to solving this problem; we’ll continue with this example in the next chapter ( 186).
POSIX and the Longest-Leftmost Rule
The POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text must be the one returned.
The POSIX standard document uses the phrase “longest of the leftmost.” It doesn’t say you have to use a DFA, so if you want to use an NFA when creating a POSIX tool, what’s a programmer to do? If you want to implement a POSIX NFA, you’d have to find the full oneselfsufficient and all the continuation lines, despite these results being “unnatural” to your NFA.
A Traditional NFA engine stops with the first match it finds, but what if it were to continue to try options (states) that might remain? Each time it reached the end of the regex, it would have another plausible match. By the time all options are exhausted, it could simply report the longest of the plausible matches it had found. Thus, a POSIX NFA.
An NFA applied to the first example would, in matching (self)?, have saved an option noting that it could pick up matching one(self)?(selfsufficient)? at
oneselfsufficient. Even after finding the oneselfsufficient that a Traditional NFA stops at, a POSIX NFA continues to exhaustively check the remaining options, eventually realizing that yes, there is a way to match the longer (and in fact, longest) oneselfsufficient.
In Chapter 7, we’ll see a method to trick Perl into mimicking POSIX semantics, having it report the longest match ( 335).
Next: Speed and Efficiency >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|