Home arrow Java arrow Page 2 - NFA, DFA, POSIX, and the Mechanics of Expression Processing
JAVA

NFA, DFA, POSIX, and the Mechanics of Expression Processing


This article, the last of a multi-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 5 stars5 stars5 stars5 stars5 stars / 1
January 18, 2007
TABLE OF CONTENTS:
  1. · NFA, DFA, POSIX, and the Mechanics of Expression Processing
  2. · NFA, DFA, and POSIX
  3. · Speed and Efficiency
  4. · DFA versus NFA: Differences in ease of implementation

print this article
SEARCH DEVARTICLES

TOOLS YOU CAN USE

advertisement
NFA, DFA, POSIX, and the Mechanics of Expression Processing - NFA, DFA, and POSIX
(Page 2 of 4 )

"The Longest-Leftmost"

Let me repeat what I’ve said before: when the transmission starts a DFA engine from some particular point in the string, and there exists a match or matches to be found at that position, the DFA finds the longest possible match, period. Since it’s the longest from among all possible matches that start equally furthest to the left, it’s the “longest-leftmost” match.

Really, the longest

Issues of which match is longest aren’t confined to alternation. Consider how an NFA matches the (horribly contrived) one(self)?(selfsufficient)?  against the string oneselfsufficient. An NFA first matches  one and then the greedy (self)?,  leaving (selfsufficient)? left to try against sufficient. It doesn’t match, but that’s OK since it is optional. So, the Traditional NFA returns oneselfsufficient and discards the untried states. (A POSIX NFA is another story that we’ll get to shortly.)

On the other hand, a DFA finds the longer oneselfsufficient. An NFA would also find that match if the initial (self)? were to somehow go unmatched, as that would leave (selfsufficient)? then able to match. A Traditional NFA doesn’t do that, but the DFA finds it nevertheless, since it’s the longest possible match available to the current attempt. It can do this because it keeps track of all matches simultaneously, and knows at all times about all possible matches.

I chose this silly example because it’s easy to talk about, but I want you to realize that this issue is important in real life. For example, consider trying to match continuation lines. It’s not uncommon for a data specification to allow one logical line to extend across multiple real lines if the real lines end with a backslash before the newline. As an example, consider the following:

  SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \
          missing.c msg.c node.c re.c version.c

You might normally want to use ˆ\w+=.* to match this kind of “var = value” assignment line, but this regex doesn’t consider the continuation lines. (I’m assuming for the example that the tool’s dot won’t match a newline.) To match continuation lines, you might consider appending (\\\n.*)* to the regex, yielding ˆ\w+=.*(\\\n.*)*. Ostensibly, this says that any number of additional logical lines are allowed so long as they each follow an escaped newline. This seems reasonable, but it will never work with a traditional NFA. By the time the original .* has reached the newline, it has already passed the backslash, and nothing in what was added forces it to backtrack
( 152). Yet, a DFA finds the longer multiline match if available, simply because it is, indeed, the longest.

If you have lazy quantifiers available, you might consider using them, such as with ˆ\w+=.*?(\\\n.*?)*.  This allows the escaped newline part to be tested each time before the first dot actually matches anything, so the thought is that the \\ then gets to match the backslash before the newline. Again, this won’t work. A lazy quantifier actually ends up matching something optional only when forced to do so, but in this case, everything after the is optional, so there’s nothing to force the lazy quantifiers to match anything. Our lazy example matches only ‘SRC=',  so it’s certainly not the answer.

There are other approaches to solving this problem; we’ll continue with this example in the next chapter ( 186).

POSIX and the Longest-Leftmost Rule

The POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text must be the one returned.

The POSIX standard document uses the phrase “longest of the leftmost.” It doesn’t say you have to use a DFA, so if you want to use an NFA when creating a POSIX tool, what’s a programmer to do? If you want to implement a POSIX NFA, you’d have to find the full oneselfsufficient and all the continuation lines, despite these results being “unnatural” to your NFA.

A Traditional NFA engine stops with the first match it finds, but what if it were to continue to try options (states) that might remain? Each time it reached the end of the regex, it would have another plausible match. By the time all options are exhausted, it could simply report the longest of the plausible matches it had found. Thus, a POSIX NFA.

An NFA applied to the first example would, in matching (self)?, have saved an option noting that it could pick up matching one(self)?(selfsufficient)? at
oneselfsufficient. Even after finding the oneselfsufficient that a Traditional NFA stops at, a POSIX NFA continues to exhaustively check the remaining options, eventually realizing that yes, there is a way to match the longer (and in fact, longest) oneselfsufficient.

In Chapter 7, we’ll see a method to trick Perl into mimicking POSIX semantics, having it report the longest match ( 335).


blog comments powered by Disqus
JAVA ARTICLES

- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...
- Generics and Limitations in Java
- Getting Started with Java Web Development in...

Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 



© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 8 - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials