Java
  Home arrow Java arrow Introduction to the Java.util.regex Object...
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Actuate Whitepapers 
Moblin 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Introduction to the Java.util.regex Object Model
By: Apress Publishing
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 7
    2005-08-18

    Table of Contents:
  • Introduction to the Java.util.regex Object Model
  • public static Pattern compile(String regex, int flags) Throws a PatternSyntaxException
  • public String[] split(CharSequence input)
  • The Matcher Object
  • public int start(int group)
  • public int end(int group)
  • public String group(int group)
  • public boolean find()
  • public Matcher appendReplacement (StringBuffer sb, String replacement)
  • Special Notes
  • New String Rejex-Friendly Methods
  • Summary

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT

    Stay one step ahead of the competition. Evaluate and give feedback on some of the hottest web development tools on the market today. Make your opinion heard! Click Here

    Introduction to the Java.util.regex Object Model


    (Page 1 of 12 )

    If you have ever wanted to know all about the Pattern and Matcher classes of Java's new java.util.regex package, this article is an excellent place to start. It is taken from chapter 2 of the book Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

    “How do you eat an elephant? One bite at a time.”

    — Proverb

    JAVA’S NEW java.util.regex package offers an elegant and agile object model with which to meet regular expression needs. It is composed, in its entirety, of three objects: the Pattern object, the Matcher object, and a PatternSyntaxException. This chapter details all of the methods and fields of the Pattern and Matcher classes, and provides examples of their use. In this chapter I also discuss the new regular expression supportive methods retrofitted into the String class.

    The chapter starts with an examination of the Pattern class. I cover every method and field and often provide examples. I also explore the Matcher class in similar detail. Finally, I round out the chapter by discussing the regex methods that have been added to the String class.

    The Pattern Object

    Figure 2-1 shows the methods of the Pattern class. The figure is a UML rendering of the Pattern class that illustrates the various methods and constants of the class.


    Figure 2-1.  A UML representation of the Pattern class 

    Let’s examine the fields and methods of the Pattern class in detail. If you aren’t familiar with UML, here’s a quick guide to reading Figure 2-1:

    • The name of the class is Pattern, and it’s in the topmost section of the rectangle.

    • The middle section is a grouping of the field variables. The plus sign (+) preceding these indicates that they’re public. The underline indicates that they’re static, and the “: int”  indicates that they’re of type int. The “= num” indicates the default value.

    • The bottommost section of the rectangle holds the class’s methods. Again, the plus sign indicates public access, and the underline indicates that the method is static. The parentheses designate the parameters for a given method; thus, flags() takes no parameters, whereas matcher(input : CharSequence) takes variable named input of type CharSequence. The colon (:) toward the end indicates a type.

    The following sections describe the fields and methods of the Pattern class.

    public static final int UNIX_LINES

    The UNIX_LINES flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. Use this flag when you parse data that originates on a UNIX machine.

    On many flavors of UNIX, the invisible character \n is used to note termination of a line. This is distinct from other operating systems, including flavors of Windows, which may use \r\n,\n,\r, \u2028, or \u0085 for a line terminator.

    If you transport a file that originated on a UNIX machine to a Windows platform and open it, you may notice that the lines will sometimes not terminate in the expected manner, depending on which editor you use to view the text. This happens because the two systems can use different syntax to denote the end of a line.

    The UNIX_LINES flag simply tells the regex engine that it’s dealing with UNIX-style lines, which affects the matching behavior of the regular expression metacharacters ^ and $. Using the UNIX_LINES flag, or the equivalent (?d) regex pattern, doesn’t degrade performance. By default, this flag isn’t set.

    public static final int CASE_INSENSITIVE

    The CASE_INSENSITIVE field is used when constructing the second parameter of the Pattern.compile(String regex, int flags) method. It’s useful when you need to match ASCII characters, regardless of case.

    Using this flag or the equivalent (?I) regular expression can cause performance to degrade slightly. By default, this flag isn’t set.

    public static final int COMMENTS

    The COMMENTS flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that the regex pattern has an embedded comment in it. Specifically, it tells the regex engine to ignore any comments in the pattern, starting with the spaces leading up to the # character and everything thereafter, until the end of the line.

    Thus, the regex pattern A   #matches uppercase US-ASCII char code 65 will use A as the regular expression, but the spaces leading to the # character and everything after it until the end of the line will be ignored. Your code might end up looking something like this:

    Pattern p=
    Pattern.compile("A   #matches uppercase US-ASCII char code 65",Pattern.COMMENTS);

    Think of the # character as the regex equivalent of the java comment //. By using the COMMENTS flag in compiling your regex, you’re telling the regex engine that your expression contains comments, which should be ignored. This can be useful if your pattern is particularly complex or subtle. When you don’t set this flag, the regex engine will attempt to interpret and use your comments as part of the regular expression.

    Using this flag or the equivalent (?x) regular expression doesn’t degrade performance.

    public static final int MULTILINE

    The MULTILINE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that regex input isn’t a single line of code; rather, it contains several lines that have their own termination characters.

    This means that the beginning-of-line character, ^, and the end-of-line character, $, will potentially match several lines within the input String. For example, imagine that your input String is This is a sentence. \n  So is this.. If you use the MULTILINE flag to compile the regular expression pattern

    Pattern p = Pattern.compile("^", Pattern.MULTILINE);

    then the beginning of line character, ^, will match before the T in This is a sentence. It will also match just before the S in So is this. When you don’t use the MULTILINE flag, the match will only find the T in This is a sentence.

    Using this flag or the equivalent (?m) regular expression may degrade performance.

    public static final int DOTALL

    The DOTALL flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. The DOTALL flag tells the regex engine to allow the metacharacter period to match any character, including line termination characters. What does this mean?

    Imagine that your candidate String is Test\n. If your corresponding regex pattern is.you would normally have four matches: one for the T, another for the e, another for s, and the fourth for t. This is because the regex metacharacter.will normally match any character except a line termination character.

    Enabling the DOTALL flag as follows:

    Pattern p = Pattern.compile(".", Pattern.DOTALL);

    would generate five matches. Your pattern would match the T, e, s, and t characters. In addition, it would match the \n character at the end of the line.

    Using this flag or the equivalent (?s) regular expression doesn’t degrade performance.

    public static final int UNICODE_CASE

    The UNICODE_CASE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It is used in conjunction with the CASE_INSENITIVE flag to generate case-insensitive matches for the international character sets.

    Using this flag or the equivalent (?u) regular expression can degrade performance.

    public static final int CANON_EQ

    The CANON_EQ flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. As you know, characters are actually stored as numbers. For example, in the ASCII character set, the character A is represented by the number 65. Depending on the character set that you’re using, the same character can be represented by different numeric combinations. For example, à can be represented by both +00E0 and U+0061U+0300. A CANON_EQ match would match either representation.

    Using this flag may degrade performance.

    public static Pattern compile(String regex) Throws a PatternSyntaxException

    You’ll notice that the Pattern class doesn’t have a public constructor. This means that you can’t write the following type of code:

    Pattern p = new Pattern("my regex");//wrong!

    To get a reference to a Pattern object, you must use the static method compile(String regex). Thus, your first line of regex code might look like the following:

    Pattern p = Pattern.compile("my regex");//Right!

    The parameter for this method is a String that represents a regular expression. When you pass a String to a method that expects a regular expression, it’s important to delimit any \characters that the regular expressions might have by appending another \character to them. This is because String objects internally use the \character to delimit metacharacters in character sequences, regardless of whether those character sequences are regular expressions. This was true long before regular expressions were part of Java. Thus, the regular expression \d becomes \\d. To match a single digit, your regular expression becomes the following:

    Pattern p = Pattern.compile("\\d");

    The point here being that the regular expression \d becomes the String \\d.

    The delimitation of the String parameter can sometimes be tricky, so it’s important to understand it well. By and large, it means that you double the \characters that might already be present in the regular expression. It does not mean that you simply append a single \character. I present an example to illustrate this shortly.

    The compile method will throw a java.util.regex.PatternSyntaxException if the regular expression itself is badly formed. For example, if you pass in a String that contains [4, the compile method will throw a PatternSyntaxException at runtime, because the syntax of the regular expression [4 is illegal, as shown in Listing 2-1.

    Listing 2-1. Using the compile Method

    import java.util.regex.*;
    public class DelimitTest{
      public static void main(String args[]){
       
    //throws exception
        Pattern p = Pattern.compile("[4");
      }
    }

    Does this mean that you have to catch a PatternSyntaxException every time you use a regular expression? No. PatternSyntaxException doesn’t have to be explicitly caught, because it extends from a RuntimeException, and a RuntimeException doesn’t need to be explicitly caught.

    The compile(String regex) method returns a Pattern object.

    More Java Articles
    More By Apress Publishing


     

    Buy this book now. This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway