Home arrow Java arrow Greediness, Backtracking, and Expression Processing
JAVA

Greediness, Backtracking, and Expression Processing


This article, the third of a four-part series, discusses how a regular expression engine works. It is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Copyright © 2006 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

Author Info:
By: O'Reilly Media
Rating: 4 stars4 stars4 stars4 stars4 stars / 4
January 11, 2007
TABLE OF CONTENTS:
  1. · Greediness, Backtracking, and Expression Processing
  2. · Using Lazy Quantifiers
  3. · The Essence of Greediness, Laziness, and Backtracking
  4. · The essence of atomic grouping

print this article
SEARCH DEVARTICLES

Greediness, Backtracking, and Expression Processing
(Page 1 of 4 )

More About Greediness and Backtracking

Many concerns (and benefits) of greediness are
shared by both an NFA and a DFA. (A DFA doesn’t support laziness, which is why we’ve concentrated on greediness up to this point.) I’d like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there’s not much


Quiz Answer

Answer to the question on page 162.

When matching [0-9] * against  ‘a 1234 num', would 'a 1234 num' be part of a saved state?

The answer is “no.” I posed this question because the mistake is commonly made. Remember, a component that has star applied can always match. If that’s the entire regex, it can always match anywhere. This certainly includes the attempt when the transmission applies the engine the first time, at the start of the string. In this case, the regex matches at ‘a 1234 num’ and that’s the end of it—it never even gets as far the digits.

In case you missed this, there’s still a chance for partial credit. Had there been something in the regex after the [0-9] * that kept an overall match from happening before the engine got to:

then indeed, the attempt of the ‘1’ also creates the state:

 


more to say after that. It’s very easy to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed
nature provides. Besides lazy quantifiers, there are a variety of extra features an NFA can support, including lookaround, conditionals, backreferences, and atomic grouping. And on top of these, an NFA affords the regex author direct control over how a match is carried out, which can be a benefit when used properly, but it does create some efficiency-related pitfalls (discussed in Chapter 6.)

 

Despite these differences, the match results are often similar. For the next few pages, I’ll talk of both engine types, but describe effects in terms of the regex-
directed NFA. By the end of this chapter, you’ll have a firm grasp of just when the results might differ, as well as exactly why.

Problems of Greediness

As we saw with the last example, .* always marches to the end of the line.This is because .* just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.

Sometimes this can be a real pain. Consider a regex to match text wrapped in double quotes. At first, you might want to write ".*", but knowing what we know about .*, guess where it matches in:

  The name "McDonald’s" is said "makudonarudo" in Japanese

Actually, since we understand the mechanics of matching, we don’t need to guess, because we know. Once the initial quote matches, .* is free to match, and immediately does so all the way to the end of the string. It backs off (or, perhaps more appropriately, is backed off by the regex engine) only as much as is needed until the final quote can match. In the end, it matches

  The name "McDonald’s" is said "makudonarudo" in Japanese

which is obviously not the double-quoted string that was intended. This is one reason why I caution against the overuse of .*, as it can often lead to surprising results if you don’t pay careful attention to greediness.

So, how can we have it match "McDonald’s" only? The key is to realize that we don’t want “anything” between the quotes, but rather “anything except a quote.” If we use [ˆ"] * rather than .*, it won’t overshoot the closing quote.

The regex engine’s basic approach with " [ˆ"] *" is exactly the same as before. Once the initial double quote matches, [ˆ"] * gets a shot at matching as much as it can. In this case, that’s up to the double quote after McDonald’s, at which point it finally stops because [ˆ"] can’t match the quote. At that point, control moves to the closing ". It happily matches, resulting in overall success:

  The name "McDonald’s" is said "makudonarudo" in Japanese

Actually, there could be one unexpected change, and that’s because in most flavors, [ˆ"] can match a newline, while dot doesn’t. If you want to keep the regex from crossing lines, use [ˆ"\n] .

Multi-Character “Quotes”

In the first chapter, I talked a bit about matching HTML tags, such as the sequence <B>very</B> that renders the “very” in bold if the browser can do so. Attempting to match a <B>...</B> sequence seems similar to matching a quoted string, except the “quotes” in this case are the multi-character sequences <B> and </B>. Like the quoted string example, multiple sets of “quotes” cause problems if we use .*:

  ˙˙˙<B>Billions</B> and <B>Zillions</B> of suns...

With <B> .* </B>, the greedy .* causes the match in progress to zip to the end of the line, backtracking only far enough to allow the </B> to match, matching the last </B> on the line instead of the one corresponding to the opening <B> at the start of the match.

Unfortunately, since the closing delimiter is more than one character, we can’t solve the problem with a negated class as we did with double-quoted strings. We can’t expect something like <B>[ˆ</B>]*</B> to work. A character class represents only one character and not the full </B> sequence that we want. Don’t let the apparent structure of [ˆ</B>] fool you. It is just a class to match one character—any one except <, >/, and B. It is the same as, say [^/<>B], and certainly  doesn’t work as an “anything not </B>"  construct. (With lookahead, you can insist that </B> not match at a particular point; we’ll see this in action in the next section.)


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials