Greediness, Backtracking, and Expression Processing
(Page 1 of 4 )
This article, the third of a four-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book
Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.
More About Greediness and Backtracking
Many concerns (and benefits) of greediness are
shared by both an NFA and a DFA. (A DFA doesn’t support laziness, which is why we’ve concentrated on greediness up to this point.) I’d like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there’s not much
Quiz Answer
Answer to the question on page 162.
When matching [0-9] * against ‘a 1234 num', would 'a 1234 num' be part of a saved state?
The answer is “no.” I posed this question because the mistake is commonly made. Remember, a component that has star applied can always match. If that’s the entire regex, it can always match anywhere. This certainly includes the attempt when the transmission applies the engine the first time, at the start of the string. In this case, the regex matches at ‘a 1234 num’ and that’s the end of it—it never even gets as far the digits.
In case you missed this, there’s still a chance for partial credit. Had there been something in the regex after the [0-9] * that kept an overall match from happening before the engine got to:

then indeed, the attempt of the ‘1’ also creates the state:

more to say after that. It’s very easy to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed
nature provides. Besides lazy quantifiers, there are a variety of extra features an NFA can support, including lookaround, conditionals, backreferences, and atomic grouping. And on top of these, an NFA affords the regex author direct control over how a match is carried out, which can be a benefit when used properly, but it does create some efficiency-related pitfalls (discussed in Chapter 6.)
Despite these differences, the match results are often similar. For the next few pages, I’ll talk of both engine types, but describe effects in terms of the regex-
directed NFA. By the end of this chapter, you’ll have a firm grasp of just when the results might differ, as well as exactly why.
Problems of Greediness
As we saw with the last example, .* always marches to the end of the line.† This is because .* just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.
Sometimes this can be a real pain. Consider a regex to match text wrapped in double quotes. At first, you might want to write ".*", but knowing what we know about .*, guess where it matches in:
The name "McDonald’s" is said "makudonarudo" in Japanese
Actually, since we understand the mechanics of matching, we don’t need to guess, because we know. Once the initial quote matches, .* is free to match, and immediately does so all the way to the end of the string. It backs off (or, perhaps more appropriately, is backed off by the regex engine) only as much as is needed until the final quote can match. In the end, it matches
The name "McDonald’s" is said "makudonarudo" in Japanese
which is obviously not the double-quoted string that was intended. This is one reason why I caution against the overuse of .*, as it can often lead to surprising results if you don’t pay careful attention to greediness.
So, how can we have it match "McDonald’s" only? The key is to realize that we don’t want “anything” between the quotes, but rather “anything except a quote.” If we use [ˆ"] * rather than .*, it won’t overshoot the closing quote.
The regex engine’s basic approach with " [ˆ"] *" is exactly the same as before. Once the initial double quote matches, [ˆ"] * gets a shot at matching as much as it can. In this case, that’s up to the double quote after McDonald’s, at which point it finally stops because [ˆ"] can’t match the quote. At that point, control moves to the closing ". It happily matches, resulting in overall success:
The name "McDonald’s" is said "makudonarudo" in Japanese
Actually, there could be one unexpected change, and that’s because in most flavors, [ˆ"] can match a newline, while dot doesn’t. If you want to keep the regex from crossing lines, use [ˆ"\n] .
Multi-Character “Quotes”
In the first chapter, I talked a bit about matching HTML tags, such as the sequence <B>very</B> that renders the “very” in bold if the browser can do so. Attempting to match a <B>...</B> sequence seems similar to matching a quoted string, except the “quotes” in this case are the multi-character sequences <B> and </B>. Like the quoted string example, multiple sets of “quotes” cause problems if we use .*:
˙˙˙<B>Billions</B> and <B>Zillions</B> of suns...
With <B> .* </B>, the greedy .* causes the match in progress to zip to the end of the line, backtracking only far enough to allow the </B> to match, matching the last </B> on the line instead of the one corresponding to the opening <B> at the start of the match.
Unfortunately, since the closing delimiter is more than one character, we can’t solve the problem with a negated class as we did with double-quoted strings. We can’t expect something like <B>[ˆ</B>]*</B> to work. A character class represents only one character and not the full </B> sequence that we want. Don’t let the apparent structure of [ˆ</B>] fool you. It is just a class to match one character—any one except <, >, /, and B. It is the same as, say [^/<>B], and certainly doesn’t work as an “anything not </B>" construct. (With lookahead, you can insist that </B> not match at a particular point; we’ll see this in action in the next section.)
Next: Using Lazy Quantifiers >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.
|
|