Java
  Home arrow Java arrow Page 3 - The Mechanics of Expression Processing
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

The Mechanics of Expression Processing
By: O'Reilly Media
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 1
    2006-12-28

    Table of Contents:
  • The Mechanics of Expression Processing
  • Regex Engine Types
  • Rule 1: The Match That Begins Earliest Wins
  • Rule 2: The Standard Quantifiers Are Greedy

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    The Mechanics of Expression Processing - Rule 1: The Match That Begins Earliest Wins


    (Page 3 of 4 )

    This rule says that any match that begins earlier (leftmost) in the string is always preferred over any plausible match that begins later. This rule doesn’t say anything about how long the winning match might be (we’ll get into that shortly), merely that among all the matches possible anywhere in the string, the one that begins leftmost in the string is chosen. Actually, since more than one plausible match can start at the same earliest point, perhaps the rule should read “a match...” instead of “the match...,” but that sounds odd.

    Here’s how the rule comes about: the match is first attempted at the very beginning of the string to be searched (just before the first character). “Attempted” means that every permutation of the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. A “no match” result is reported only if no match is found after the full retry has been attempted at each position all the way to the end of the string (just after the last character).

    Thus, when trying to match ORA against FLORAL, the first attempt at the start of the string fails (since ORA can’t match FLO). The attempt starting at the second character also fails (it doesn’t match LOR either). The attempt starting at the third position, however, does match, so the engine stops and reports the match: FLORAL.

    If you didn’t know this rule, results might sometimes surprise you. For example, when matching cat against

      The dragging belly indicates your cat is too fat

    the match is in indicates, not at the word cat that appears later in the line. This word cat could match, but the cat in indicates appears earlier in the string, so it is the one matched. For an application like egrep, the distinction is irrelevant because it cares only whether there is a match, not where the match might be. For other uses, such as with a search and replace, the distinction becomes paramount.

    Here’s a (hopefully simple) quiz: where does fat cat belly your match in the string The dragging belly indicates your cat is too fat’?  Turn the page to check your answer.

    The “transmission” and the bump-along

    It might help to think of this rule as the car’s transmission, connecting the engine to the drive train while adjusting for the gear you’re in. The engine itself does the real work (turning the crank); the transmission transfers this work to the wheels.

    The transmission’s main work: the bump-along

    If the engine can’t find a match starting at the beginning of the string, it’s the transmission that bumps the regex engine along to attempt a match at the next position in the string, and the next, and the next, and so on. Usually. For instance, if a regex begins with a start-of-string anchor, the transmission can realize that any bump-along would be futile, for only the attempt at the start of the string could possibly be successful. This and other internal optimizations are discussed in Chapter 6.

    Engine Pieces and Parts

    An engine is made up of parts of various types and sizes. You can’t possibly hope to truly understand how the whole thing works if you don’t know much about the individual parts. In a regex, these parts are the individual units—literal characters, quantiers (star and friends), character classes, parentheses, and so on, as described in Chapter 3 ( 114). The combination of these parts (and the engine’s treatment of them) makes a regex what it is, so looking at ways they can be combined and how they interact is our primary interest. First, let’s take a look at some of the individual parts:

    Literal text (e.g.,  a  \*  !  M ... )

    With a literal, non-metacharacter like z or !, the match attempt is simply "Does this literal character match the current text character?” If your regex is only literal text, such as usa, it is treated as “u and then s and then a. ” It’s a bit more complicated if you have the engine do a case-insensitive match, where b matches B and vice-versa, but it’s still pretty straightforward. (With Unicode, there are a few additional twists  110.)

    Character classes, dot, Unicode properties, and the like

    Matching dot, character classes, Unicode properties, and the like ( 118) is usually a simple matter: regardless of the length of the character class, it still matches just one character.

    Dot is just a shorthand for a large character class that matches almost any character ( 111), so its actions are simple, as are the other shorthand conveniences such as \w, \W, and \d.

    Capturing parentheses

    Parentheses used only for capturing text (as opposed to those used for grouping) don’t change how the match is carried out.


    Quiz Answer

    Answer to the question on page 148.

    Remember, the regex is tried completely each time, so fat|cat|belly|your matches The dragging belly indicates your cat is too fat rather than fat, even though fat is listed first among the alternatives.

    Sure, the regex could conceivably match fat and the other alternatives, but since they are not the earliest possible match (the match starting furthest to the left), they are not the one chosen. The entire regex is attempted completely from one spot before moving along the string to try again from the next spot, and in this case that means trying each alternative fat,cat, belly, and your at each position before moving on.


    Anchor's (e.g., ˆ \Z (?<=\d) ... )

     

    There are two basic types of anchors: simple ones (ˆ, $, \G, \b,  ...   129) and complex ones (lookahead and lookbehind   133). The simple ones are indeed simple in that they test either the quality of a particular location in the target string (ˆ, \Z,  ...), or compare two adjacent characters (\<, \b</b, ...). On the other hand, the lookaround constructs can contain arbitrary sub-expressions, and so can be arbitrarily complex.

    No “electric” parentheses, backreferences, or lazy quantifiers

    I’d like to concentrate here on the similarities among the engines, but as foreshadowing of what’s to come in this chapter, I’ll point out a few interesting
    differences. Capturing parentheses (and the associated backreferences and $1 type functionality) are like a gas additive—they affect a gasoline (NFA) engine, but are irrelevant to an electric (DFA) engine. The same thing applies to lazy quantifiers. The way a DFA engine works completely precludes these concepts.This explains why tools developed with DFAs don’t provide these features. You’ll notice that awk, lex, and egrep don’t have backreferences or any $1 type functionality.

    You might, however, notice that GNU’s version of egrep does support backreferences. It does so by having two complete engines under the hood! It first uses a DFA engine to see whether a match is likely, and then uses an NFA engine (which supports the full flavor, including backreferences) to confirm the match. Later in this chapter, we’ll see why a DFA engine can’t deal with backreferences or capturing, and why anyone ever would want to use such an engine at all. (It has some major advantages, such as being able to match very quickly.)

    More Java Articles
    More By O'Reilly Media


       · This article is an excerpt from the book "Mastering Regular Expressions, Third...
     

    Buy this book now. This article is excerpted from chapter four of the book Mastering Regular Expressions, Third Edition, written by Jeffrey E.F. Friedl (O'Reilly, 2006; ISBN: 0596528124). Check it out today at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 5 Hosted by Hostway
    For more Enterprise Application Development news, visit eWeek