Home arrow Java arrow Page 5 - Advanced Regex
JAVA

Advanced Regex


Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 18
July 07, 2005
TABLE OF CONTENTS:
  1. · Advanced Regex
  2. · Noncapturing Subgroups
  3. · Greedy Qualifiers
  4. · Reluctant Qualifiers
  5. · Understanding Lookarounds
  6. · Zen and the Art of Efficient Expressions
  7. · Summary

print this article
SEARCH DEVARTICLES

Advanced Regex - Understanding Lookarounds
(Page 5 of 7 )

There are times in programming, as in life, when you’d like to know what to expect before making a more serious effort. For example, you might want to know that your favorite restaurant is open before you go there to eat. How would you accomplish that? You would, of course, phone ahead. The same idea is used in regex lookarounds.

There are four flavors of lookarounds: positive lookaheads, negative lookaheads, positive lookbehinds, and negative lookbehinds. The following sections explain each in detail.

NOTE   Lookarounds are noncapturing groups, but they never consume text. Thus, verifying that a certain character exists further down the candidate string doesn’t mean that the character in question has been exhausted by the regex pattern. Lookaheads don’t match characters; they match positions.

Positive Lookaheads

Positive lookaheads allow your regex to “peek ahead” and make sure that the pattern does, in fact, exist somewhere down the line in your candidate string before the rest of the match is attempted. They don’t consume that text, however—they just confirm the truth of its existence. They are, basically, a way to tell the regex engine “Don’t bother looking at the candidate string if it doesn’t have the lookahead.” You form them by opening the group with the characters (?=. For example, the lookahead

(?=\d\d)

confirms that the candidate string contains two digits in a row. However, it doesn’t consume those two digits. Combined with other regex patterns, positive lookaheads can a very powerful weapon in your regex arsenal.

Say you want to match IP addresses, but only if they begin with 255. Also, if they do begin with 255, you want the entire regex pattern. With lookaheads, this issue is easily solved, as demonstrated in Listing 3-8. Of course, this example assumes a great deal about the friendly nature of the data. Even so, it does nicely illustrate the usage of lookaheads, so all is forgiven. Table 3-2, which follows Listing 3-8, deconstructs the regex pattern (?=^255).*.

Listing 3-8. Simple Positive Lookahead Example

import java.util.regex.*;
public class PositiveLookaheadExample{
 
public static void main(String args[]){
   
//define the pattern
   
String regex = "(?=^255).*";
   
//compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
//define the candidate string
   
String candidate = "255.0.0.1";
   
//extract a matcher for the candidate string
    Matcher matcher = pattern.matcher(candidate);
   
String ip ="not found";
   
//if the candidate starts with 255, then the ip
    //will be populated with the correct information.
    if (matcher.find())
       
ip=matcher.group();
   
String msg ="ip: " + ip;
    System.out.println(msg);
  }
} 

Table 3-2. The Pattern (?=^255).* Regex

Description

(?=

A positive lookahead consisting of

^

The beginning of line character followed by

2

The character 2 followed by

5

The character 5 followed by

5

The character 5 followed by

)

Close the lookahead group

.

Any character

*

Repeated zero or more times

In Listing 3-8, the regex engine first confirms that the candidate string starts with 255 before attempting to execute the rest of the pattern. If the candidate String doesn’t do so, then the rest of the pattern can’t possibly match and no resources are wasted in attempting to do so.

Notice that using a noncapturing group (?:=^255) instead of           (?=^255) to confirm the existence of 255wouldn’t work, because(?:=^255) consumes the characters 255, even though it doesn’t capture them, and returns the .0.0.1 that follows them.

Negative Lookaheads

Negative lookaheads, like positive lookaheads, allow your regex to “peek ahead.” However, they allow the engine to confirm that something does not exist somewhere down the line in your candidate string. Like all lookaheads, they don’t consume text; they just confirm the truth of its absence. They’re formed by opening the group with the characters (?!. For example:

(?!\d\d)

confirms that the candidate String doesn’t contain two digits in a row. It doesn’t consume those two digits.

Say you’re parsing text and you want find reference to John and extract both the first name and the last name, unless that reference happens to John Smith. With negative lookaheads, this sort of exercise becomes very easy. Listing 3-9 demonstrates the code for doing so. Table 3-3 deconstructs the regex pattern used.

Listing 3-9. Simple Negative Lookahead Example

import java.util.regex.*;
public class NegativeLookaheadExample{
  public static void main(String args[])
  throws Exception
  {
   
//define the pattern
    String regex = "John (?!Smith)[A-Z]\\w+";
   
//compile the pattern
    Pattern pattern = Pattern.compile(regex);
   
String candidate = "I think that John Smith ";
    candidate +="is a fictional character. His real name ";
    candidate +="might be John Jackson, John Westling, "; 
    candidate +="or John Holmes for all we know.";
   
//extract a matcher for the candidate string
    Matcher matcher = pattern.matcher(candidate);
   
String tmp=null;
   
//extract the matching group. Notice that it's
    //the default group, since lookarounds are     //noncapturing
    while (matcher.find()){
       
tmp=matcher.group();
        System.out.println("MATCH:" + tmp);
    }
  }
}

Table 3-3. The Pattern John (?!Smith)[A-Z]\\w+ Regex

Description

J

The character Jfollowed by

o

The character ofollowed by

h

The character hfollowed by

n

The character nfollowed by

<space>

A space, followed by

(?!

A position in which you’ll find anything but

S

The character Sfollowed by

m

The character mfollowed by

i

The character ifollowed by

t

The character tfollowed by

h

The character hfollowed by

)

Close the lookahead group, followed by

[A-Z]

Any uppercase character followed by

\w

A word character

+

Repeated one or more times followed by

\w

Any word character

*

Repeated zero or more times

In Listing 3-9, the regex engine first parses the candidate and considers successful matches to be those that consist of John when it’s followed by some capitalized word, unless that capitalized word is Smith. Again, it’s important to notice that using a noncapturing group allows you to capture the entire match, because it hasn’t been consumed.

Positive Lookbehinds

So far, you’ve explored the ability to look to the right of the candidate String to “peek ahead” and see what the future has in store for your pattern. Similarly, there are times when it’s useful to be able to look to the left of the current position being considered to see what the past had to say about a particular pattern. That is the purpose of lookbehinds.

Like lookaheads, lookbehinds come in two flavors. Positive lookbehinds confirm the existence of a pattern to the left of the current position, andnegative lookbehindsconfirm the absence of a pattern to the left of the current pattern. You form positive lookbehinds by opening a noncapturing group with (?<=. Thus, to confirm that two digits preceded the current expression, you might use the following positive lookbehind:

(?<=\d\d).*

This confirms that the candidate string was preceded by two digits in a row. It doesn’t consume those two digits; however, it acts like it did because they’re beyond the scope of the capture. This happens because the expression parser has already moved past them. That is, the parse has, by definition, already tried to match them and failed to do so. It if hadn’t, it would have stopped trying to find the next match.

Consider the candidate 42 is the answer. When the regex engine compares this candidate String against the pattern (?<=\d\d).*, it starts by examining the first character, which is 4. Because two digits don’t precede 4, it’s rejected. Next, the engine compares the 2 character. Because 2 is also not preceded by two digits, it is discarded. Next, the regex engine examines the space character following 2 in the candidate string 42 is the answer. Because that space character is, in fact, preceded by two digits, namely 4 and 2, the regex engine happily starts to match. Of course, because the remaining part of the pattern is .*, every remaining character is matched. Thus, the space following 42 and everything thereafter is captured. But 4 and 2 aren’t captured, because the regex engine already passed them.

Because the regex engine is already past the 4 and the 2 characters, it won’t match them. This is an important and subtle distinction. Lookbehinds, like all lookarounds, are noncapturing. However, in this case, they appear to act as if they’ve already captured the 4 and the 2 characters. That is, the characters 4 and 2 are excluded from the capture set. However, that’s because they’ve already been parsed, not because they’ve been captured. It’s important to be able to see through this illusion.

Listing 3-10 demonstrates some code for using positive lookbehinds. The goal is to parse a document’s content and extract any URLs used. Table 3-4 deconstructs the regex pattern used.

Listing 3-10. Simple Positive Lookbehind Example

import java.util.regex.*;
public class PositiveLookBehindExample{
  public static void main(String args[])
  throws Exception
  {
      //define the pattern
      String regex = "(?<=http://)\\S+";
      //compile the pattern
      Pattern pattern = Pattern.compile(regex);
      String candidate = "The Apress website can be found at "; 
      candidate +=
http://www.apress.com. There, ;
      candidate +="you can find information about some of ";
      candidate +="best books in the industry, including the ";
      candidate +=" bestselling Sun Certified Java Developer "; 
      candidate +=" Exam with J2SE(";
      candidate +=http://www.apress.com/book/bookDisplay.;
      candidate +="html?bID=39) as well as others.";
     
//extract a matcher for the candidate string
      Matcher matcher = pattern.matcher(candidate);
     
//if the url was found, print it out here.
     
while (matcher.find()){
         String msg =":"+ matcher.group()+":";  
         System.out.println(msg);
     
}
   }
}

Table 3-4. The Pattern (?<=http://)\S+

Regex

Description

(?<=

Open a positive lookbehind group consisting of

h

The character hfollowed by

t

The character tfollowed by

t

The character tfollowed by

p

The character pfollowed by

:

The character :followed by

/

The character

/

followed by

/

The character

/

followed by

\S

A nonspace character

+

Repeated one or more times

*

In English:Match a URL if that URL is preceded by http://.

Negative Lookbehinds

Negative lookbehinds confirm the absence of a pattern to the left of the current pattern. They’re a way of telling the regex engine, “I’m interested in the candidate String, so long as it isn’t preceded by such and such.” You form negative lookbehinds by opening a noncapturing group with (?<!.

Negative lookbehinds aren’t as intuitive as the other lookarounds, so it’s worthwhile to explore how they actually work. For example, consider the following negative lookbehind:

(?<!\d\d).*

The preceding seems to request that the candidate string not be preceded by two digits in a row. However, when you actually test it against the String 42 is the answer, it matches the entire candidate. What’s going on here?

The problem is that the first element in the candidate 42 is the answer is 4. So the engine asks itself if the 4 character is preceded by two digits. Because the answer is no, the entire pattern is matched into group(0). Remember,.*is a greedy qualifier, so it matches as much as possible—in this case, the entire candidate string.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials