Home arrow Java arrow Two Engines for Expression Processing
JAVA

Two Engines for Expression Processing


This article, the second of a four-part series, continues our discussion of how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 5 stars5 stars5 stars5 stars5 stars / 3
January 04, 2007
TABLE OF CONTENTS:
  1. · Two Engines for Expression Processing
  2. · Backtracking
  3. · A match without backtracking
  4. · Backtracking and Greediness

print this article
SEARCH DEVARTICLES

Two Engines for Expression Processing
(Page 1 of 4 )

Regex-Directed Versus Text-Directed

The two basic engine types reflect a fundamental
difference in algorithms available for applying a regular expression to a string. I call the gasoline-driven NFA engine “regex-directed,” and the electric-driven DFA “text-directed.”

NFA Engine: Regex-Directed

Let’s consider one way an engine might match to(nite|knight|night) against the text ...tonight... '. Starting with the t, the regular expression is examined one component at a time, and the “current text” is checked to see whether it is matched by the current component of the regex. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved.


Quiz Answer

Answer to the question on page 153.

When ^.* ([0-9] + ) is applied to 'Copyright 2003.’, what is captured by the parentheses?

The desire is to get the last whole number, but it doesn’t work. As before, .* is forced to relinquish some of what it had matched because the subsequent [0-9] + requires a match to be successful. In this example, that means unmatching the final period and ‘3’, which then allows [0-9] to match. That’s governed by +, so matching just once fulfills its minimum, and now facing ‘.' in the string, it finds nothing else to match.

Unlike before, though, there’s then nothing further that must match, so .* is not forced to give up the 0 or any other digits it might have matched. Were .* to do so, the [0-9] + would certainly be a grateful and greedy recipient, but nope, first come first served. Greedy constructs give up something they’ve matched only when forced. In the end, $1 gets only '3'.

If this feels counter-intuitive, realize that [0-9] + is at most one match away from [0-9] *, which is in the same league as .*. Substituting that into ˆ.* ([0-9] +), we get ^.* (.*) as our regex, which looks suspiciously like the ^subject: (.*) .* example from page 152, where the second .* was guaranteed to match nothing.


With the to (nite|knight|night) example, the first component is t, which repeatedly fails until a ‘t'is reached in the target string. Once that happens, the o is checked against the next character, and if it matches, control moves to the next component. In this case, the “next component” is (nite|knight|night) which really means “nite or knight or night."  Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we’re matching tonight, the third alternative is the one that leads to a match. Despite their brainy origins ( 85), a regex-directed engine can’t come to that conclusion until actually going through the motions to check.

Attempting the first alternative, nite,  involves the same component-at-a-time treatment as before: “ Try to match n, then i, then t, and finally e."  If this fails, as it eventually does, the engine tries another alternative, and so on until it achieves a match or must report failure. Control moves within the regex from component to component, so I call it “regex-directed.”

The control benefits of an NFA engine

In essence, each subexpression of a regex in a regex-directed match is checked independently of the others. Other than backreferences, there’s no interrelation among subexpressions, except for the relation implied by virtue of being thrown together to make a larger expression. The layout of the subexpressions and regex control structur es (e.g., alternation, parentheses, and quantifiers) controls an engine’s overall movement through a match.

Since the regex directs the NFA engine, the driver (the writer of the regular expression) has considerable opportunity to craft just what he or she wants to happen. (Chapters 5 and 6 show how to put this to use to get a job done correctly and efficiently.) What this really means may seem vague now, but it will all be spelled out soon.

DFA Engine: Text-Directed

Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches “currently in the works.” In the tonight example, the moment the engine hits t, it adds a potential match to its list of those currently in
progress:

in string

in regex

after ˙˙˙tonight˙˙˙

possible matches: to(nite|knight|night) 

Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes

in string

in regex

after ˙˙˙tonight˙˙˙

possible matches: to(nite|knight|night) 

with two possible matches in the works (and one alternative, knight, ruled out). With the g that follows, only the third alternative remains viable. Once the h and t are scanned as well, the engine realizes it has a complete match and can return success.

I call this “text-directed” matching because each character scanned from the text controls the engine. As in the example, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a “partial match in progress” is also a full match. If the regex were to (...) ?, for example, the parenthesized expression becomes optional, but it’s still greedy, so it’s always attempted. All the time that a partial match is in progress inside those
parentheses, a full match (of  ‘to’) is already confirmed and in reserve in case the longer matches don’t pan out.

If the engine reaches a character in the text that invalidates all the matches in the works, it must revert to one of the full matches in reserve. If there are none, it must declare that there are no matches at the current attempt’s starting point.

First Thoughts: NFA and DFA in Comparison

If you compare these two engines based only on what I’ve mentioned so far, you might conclude that the text-directed DFA engine is generally faster. The regex-directed NFA engine might waste time attempting to match different subexpressions against the same text (such as the three alternatives in the example).

You would be right. During the course of an NFA match, the same character of the target might be checked by many different parts of the regex (or even by the same part, over and over). Even if a subexpression can match, it might have to be applied again (and again and again) as it works in concert with the rest of the regex to find a match. A local subexpression can fail or match, but you just never know about the overall match until you eventually work your way to the end of the regex. (If I could find a way to include “It’s not over until the fat lady sings.” in this paragraph, I would.) On the other hand, a DFA engine is deterministic—each character in the target is checked once (at most). When a character matches, you don’t know yet if it will be part of the final match (it could be part of a possible match that doesn’t pan out), but since the engine keeps track of all possible matches in parallel, it needs to be checked only once, period.

The two basic technologies behind regular-expression engines have the somewhat imposing names
Nondeterministic Finite Automaton (NFA) and
Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just “NFA” and “DFA.” We won’t be seeing these phrases spelled out again.

Consequences to us as users

Because of the regex-directed nature of an NFA, the details of how the engine attempts a match are very important. As I said before, the writer can exercise a fair amount of control simply by changing how the regex is written. With the tonight example, perhaps less work would have been wasted had the regex been written differently, such as in one of the following ways:

  1. to (ni(ght|te)|knight)
  2. tonite|toknight|tonight
  3. to (k?night|nite)

With any given text, these all end up matching exactly the same thing, but in doing so direct the engine in different ways. At this point, we don’t know enough to judge which of these, if any, are better than the others, but that’s coming soon.

It’s the exact opposite with a DFA—since the engine keeps track of all matches simultaneously, none of these differences in representation matter so long as in the end they all represent the same set of possible matches. There could be a hundred different ways to achieve the same result, but since the DFA keeps track of them all simultaneously (almost magically—more on this later), it doesn’t matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc, and [aa-a] (b|b{1}|b)c are utterly indistinguishable.

Three things come to my mind when describing a DFA engine:

  1. DFA matching is very fast.
  2. DFA matching is very consistent.
  3. Talking about DFA matching is very boring.

I’ll eventually expand on all these points.

The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide plenty of room for creative juices to flow. There are great benefits in crafting an expression well, and even greater penalties for doing it poorly. A gasoline engine is not the only engine that can stall and conk out completely. To get to the bottom of this, we need to look at the essence of an NFA engine: backtracking.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials