Home arrow Java arrow Regular Expressions
JAVA

Regular Expressions


Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 28
July 28, 2005
TABLE OF CONTENTS:
  1. · Regular Expressions
  2. · Creating Patterns
  3. · Common and Boundary Characters
  4. · Character Classes
  5. · Back References
  6. · Integrating Java with Regular Expressions
  7. · Confirming Name Formats Example
  8. · Finding Duplicate Words Example
  9. · Regular Expression Operations
  10. · Search and Replace
  11. · Comparing Regex and Perl

print this article
SEARCH DEVARTICLES

Regular Expressions
(Page 1 of 11 )

“Everything, you see, makes sense, if you take the trouble to work out the rational.”

— Piers Anthony

REGULAR EXPRESSIONS, or regex for short, describe text. They are a mechanism by which you can tell the Java Virtual Machine (JVM) how to find and potentially manipulate text for you. In this chapter, I’ll examine and contrast the traditional approach of describing text with the regex approach.

For example, imagine you need to validate e-mail addresses. The verbal directions for doing so might be something along the lines of “Make sure the e-mail address contains an at (@) symbol.” You could probably handle this task with a single line of Java code:

If (email.indexOf("@") > 0) {
   
return true;
}

So far, so good. Suppose additional requirements creep in, though, as they invariably do. Now you also need to make sure that all e-mail addresses end with the .org extension. So you amend your code as follows:

If ((email.indexOf("@") > 0) && ( email.endsWith(".org"))){
  
return true;
}

But the requirements continue to creep. You now need all e-mail addresses to be of the form firstname_lastname, so you use the StringTokenizer to tokenize the e-mail address, extract the part before the @, look for the underscore (_) character, tokenize the strings around that, and so on. Pretty soon, you have some convoluted code for what should be a fairly straightforward operation.

The use of regular expressions can greatly simplify and condense this process. With regular expressions, you could write the following:

String regex = "[A-Za-z]+_[A-Za-z]+@[A-Za-z]+\\.org";
if (email.matches(regex)) return true;

In English, this means “Look for one or more letters, followed by an _, followed by one or more letters, followed by an@, followed by one or more letters, followed by .org.” Notice that a period precedes the o in “org”.

Don’t be concerned if the syntax isn’t completely clear to you right now—making it clear is the aim of this book. This chapter explores the underlying concepts of Java regex, with an emphasis on actually forming and using the regex syntax. It’s a complete introduction to regular expressions, and it also serves as a preamble to the next chapter. Chapter 2, in turn, is a complete and exhaustive documentation of the J2SE regex object model.

The Building Blocks of Regular Expressions

Regular expressions in Java 2 Standard Edition (J2SE) consist of two essential parts, which are embodied by two new Java objects. The first part is a Pattern, and the second is a Matcher. Understanding these two objects is crucial to your ability to master regular expressions. Fortunately, they’re easy concepts to understand.

I define these concepts in detail in the sections that follow, but at a general level, a pattern describes what you’re searching for, and a matcher examines candidates that might match the pattern or description. For example, \s+ is a pattern describing one or spaces. Correspondingly, J2SE now provides the Pattern and Matcher objects.

NOTE  When I refer to a candidate or a candidate string, I mean the string that the regex will be acting on. Thus, for the pattern described in the preceding section, a candidate string might be coach@influxs.com, john_john_smith@w3c.org, or hana@saez.com.

Defining Patterns

Patterns are the actual descriptions used in regular expressions. Their power stems from their capability to describe text, as opposed to specifying it. They’re an important part of the regex vernacular, and you need to understand them well to use regular expressions. Fortunately, they’re easy to grasp if you refuse to be intimidated, and their somewhat off-putting syntax soon becomes intuitive.

A pattern allows you to describe the characteristics of the item you’re looking for, without specifying the item explicitly. This can be especially helpful when you only know the traits of your targets, but you’re unable to name them specifically.

Imagine parsing a document. You might want to find every capitalized word; or every word beginning with the letter Z; or every word beginning with a capital Z, followed by a vowel, unless that vowel is an a. You can’t know beforehand exactly what those words will be for a given document, but you can describe them. That description is your pattern.

I think of regular expressions as a police station. A pattern is the officer who takes a description of the suspects, and a matcher is the officer that rounds up and interrogates those suspects.

Defining Matchers

If you’re familiar with Standard Query Language (SQL), it might help you to think of regular expressions as a sort of SQL for examining free-flowing text. A pattern is conceptually similar to the SQL query that’s executed. A matcher corresponds to the ResultSet returned by that query.

A Matcher examines the results of applying a Pattern. If your pattern said, “Find every word starting with a in the previous sentence,” then you would examine the Matcher after applying your pattern. Your code might look like Listing 1-1. The output for Listing 1-1 in shown in Output 1-1, which follows the listing.

Listing 1-1. Finding Every Occurrence of the Letter A

import java.util.regex.*;
public class FindA{
  public static void main(String args[])
  throws Exception{
   
String candidate=
      "A Matcher examines the results of applying a pattern.";
   
//define the matching pattern as a
    //word boundary, a
lowercase a, any
    //number of immediately trailing letters
    //numbers, or underscores, followed by
    //a word boundary
    String regex = \\ba\\w*\\b;
    Pattern p = Pattern.compile(regex);
   
//extract the Matcher for the String text
     
Matcher m = p.matcher(candidate);
    
String val=null;
      
//display the original input string
      System.out.println("INPUT: " + candidate);
      
//display the search pattern
      System.out.println("REGEX: " + regex +"\r\n");
     
//examine the Matcher, and extract all
     //words starting with a lowercase a
      
while (m.find())
      
{
       val = m.group();
       System.out.println("MATCH: " + val);
      
}
     
//if there were no matches, say so
      if (val == null) {
       System.out.println("NO MATCHES: ");
      }      
   }
}

Output 1-1. Result of Running FindA

-----------------------------------------------------------------------------
INPUT: A Matcher examines the results of applying a pattern.
REGEX: \ba\w*\b

MATCH: applying
MATCH: a

Again, it’s not necessary that you be able to follow the code given in detail right now. I just want to establish a general sense of how things are done in J2SE regex. First, I define my Pattern:

Pattern p = Pattern.compile(regex);

Then, I feed my candidate string to the Pattern and extract a Matcher:

Matcher m = p.matcher(candidate);

Finally, I interrogate my Matcher:

while (m.find()) {….}


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials