Home arrow Java arrow Introduction to the Java.util.regex Object Model
JAVA

Introduction to the Java.util.regex Object Model


If you have ever wanted to know all about the Pattern and Matcher classes of Java's new java.util.regex package, this article is an excellent place to start. It is taken from chapter 2 of the book Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 4 stars4 stars4 stars4 stars4 stars / 15
August 18, 2005
TABLE OF CONTENTS:
  1. · Introduction to the Java.util.regex Object Model
  2. · public static Pattern compile(String regex, int flags) Throws a PatternSyntaxException
  3. · public String[] split(CharSequence input)
  4. · The Matcher Object
  5. · public int start(int group)
  6. · public int end(int group)
  7. · public String group(int group)
  8. · public boolean find()
  9. · public Matcher appendReplacement (StringBuffer sb, String replacement)
  10. · Special Notes
  11. · New String Rejex-Friendly Methods
  12. · Summary

print this article
SEARCH DEVARTICLES

Introduction to the Java.util.regex Object Model
(Page 1 of 12 )

ďHow do you eat an elephant? One bite at a time.Ē

ó Proverb

JAVAíS NEW java.util.regex package offers an elegant and agile object model with which to meet regular expression needs. It is composed, in its entirety, of three objects: the Pattern object, the Matcher object, and a PatternSyntaxException. This chapter details all of the methods and fields of the Pattern and Matcher classes, and provides examples of their use. In this chapter I also discuss the new regular expression supportive methods retrofitted into the String class.

The chapter starts with an examination of the Pattern class. I cover every method and field and often provide examples. I also explore the Matcher class in similar detail. Finally, I round out the chapter by discussing the regex methods that have been added to the String class.

The Pattern Object

Figure 2-1 shows the methods of the Pattern class. The figure is a UML rendering of the Pattern class that illustrates the various methods and constants of the class.


Figure 2-1.  A UML representation of the Pattern class 

Letís examine the fields and methods of the Pattern class in detail. If you arenít familiar with UML, hereís a quick guide to reading Figure 2-1:

  • The name of the class is Pattern, and itís in the topmost section of the rectangle.

  • The middle section is a grouping of the field variables. The plus sign (+) preceding these indicates that theyíre public. The underline indicates that theyíre static, and the ď: intĒ  indicates that theyíre of type int. The ď= numĒ indicates the default value.

  • The bottommost section of the rectangle holds the classís methods. Again, the plus sign indicates public access, and the underline indicates that the method is static. The parentheses designate the parameters for a given method; thus, flags() takes no parameters, whereas matcher(input : CharSequence) takes variable named input of type CharSequence. The colon (:) toward the end indicates a type.

The following sections describe the fields and methods of the Pattern class.

public static final int UNIX_LINES

The UNIX_LINES flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. Use this flag when you parse data that originates on a UNIX machine.

On many flavors of UNIX, the invisible character \n is used to note termination of a line. This is distinct from other operating systems, including flavors of Windows, which may use \r\n,\n,\r, \u2028, or \u0085 for a line terminator.

If you transport a file that originated on a UNIX machine to a Windows platform and open it, you may notice that the lines will sometimes not terminate in the expected manner, depending on which editor you use to view the text. This happens because the two systems can use different syntax to denote the end of a line.

The UNIX_LINES flag simply tells the regex engine that itís dealing with UNIX-style lines, which affects the matching behavior of the regular expression metacharacters ^ and $. Using the UNIX_LINES flag, or the equivalent (?d) regex pattern, doesnít degrade performance. By default, this flag isnít set.

public static final int CASE_INSENSITIVE

The CASE_INSENSITIVE field is used when constructing the second parameter of the Pattern.compile(String regex, int flags) method. Itís useful when you need to match ASCII characters, regardless of case.

Using this flag or the equivalent (?I) regular expression can cause performance to degrade slightly. By default, this flag isnít set.

public static final int COMMENTS

The COMMENTS flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that the regex pattern has an embedded comment in it. Specifically, it tells the regex engine to ignore any comments in the pattern, starting with the spaces leading up to the # character and everything thereafter, until the end of the line.

Thus, the regex pattern A   #matches uppercase US-ASCII char code 65 will use A as the regular expression, but the spaces leading to the # character and everything after it until the end of the line will be ignored. Your code might end up looking something like this:

Pattern p=
Pattern.compile("A   #matches uppercase US-ASCII char code 65",Pattern.COMMENTS);

Think of the # character as the regex equivalent of the java comment //. By using the COMMENTS flag in compiling your regex, youíre telling the regex engine that your expression contains comments, which should be ignored. This can be useful if your pattern is particularly complex or subtle. When you donít set this flag, the regex engine will attempt to interpret and use your comments as part of the regular expression.

Using this flag or the equivalent (?x) regular expression doesnít degrade performance.

public static final int MULTILINE

The MULTILINE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that regex input isnít a single line of code; rather, it contains several lines that have their own termination characters.

This means that the beginning-of-line character, ^, and the end-of-line character, $, will potentially match several lines within the input String. For example, imagine that your input String is This is a sentence. \n  So is this.. If you use the MULTILINE flag to compile the regular expression pattern

Pattern p = Pattern.compile("^", Pattern.MULTILINE);

then the beginning of line character, ^, will match before the T in This is a sentence. It will also match just before the S in So is this. When you donít use the MULTILINE flag, the match will only find the T in This is a sentence.

Using this flag or the equivalent (?m) regular expression may degrade performance.

public static final int DOTALL

The DOTALL flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. The DOTALL flag tells the regex engine to allow the metacharacter period to match any character, including line termination characters. What does this mean?

Imagine that your candidate String is Test\n. If your corresponding regex pattern is.you would normally have four matches: one for the T, another for the e, another for s, and the fourth for t. This is because the regex metacharacter.will normally match any character except a line termination character.

Enabling the DOTALL flag as follows:

Pattern p = Pattern.compile(".", Pattern.DOTALL);

would generate five matches. Your pattern would match the T, e, s, and t characters. In addition, it would match the \n character at the end of the line.

Using this flag or the equivalent (?s) regular expression doesnít degrade performance.

public static final int UNICODE_CASE

The UNICODE_CASE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It is used in conjunction with the CASE_INSENITIVE flag to generate case-insensitive matches for the international character sets.

Using this flag or the equivalent (?u) regular expression can degrade performance.

public static final int CANON_EQ

The CANON_EQ flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. As you know, characters are actually stored as numbers. For example, in the ASCII character set, the character A is represented by the number 65. Depending on the character set that youíre using, the same character can be represented by different numeric combinations. For example, ŗ can be represented by both +00E0 and U+0061U+0300. A CANON_EQ match would match either representation.

Using this flag may degrade performance.

public static Pattern compile(String regex) Throws a PatternSyntaxException

Youíll notice that the Pattern class doesnít have a public constructor. This means that you canít write the following type of code:

Pattern p = new Pattern("my regex");//wrong!

To get a reference to a Pattern object, you must use the static method compile(String regex). Thus, your first line of regex code might look like the following:

Pattern p = Pattern.compile("my regex");//Right!

The parameter for this method is a String that represents a regular expression. When you pass a String to a method that expects a regular expression, itís important to delimit any \characters that the regular expressions might have by appending another \character to them. This is because String objects internally use the \character to delimit metacharacters in character sequences, regardless of whether those character sequences are regular expressions. This was true long before regular expressions were part of Java. Thus, the regular expression \d becomes \\d. To match a single digit, your regular expression becomes the following:

Pattern p = Pattern.compile("\\d");

The point here being that the regular expression \d becomes the String \\d.

The delimitation of the String parameter can sometimes be tricky, so itís important to understand it well. By and large, it means that you double the \characters that might already be present in the regular expression. It does not mean that you simply append a single \character. I present an example to illustrate this shortly.

The compile method will throw a java.util.regex.PatternSyntaxException if the regular expression itself is badly formed. For example, if you pass in a String that contains [4, the compile method will throw a PatternSyntaxException at runtime, because the syntax of the regular expression [4 is illegal, as shown in Listing 2-1.

Listing 2-1. Using the compile Method

import java.util.regex.*;
public class DelimitTest{
  public static void main(String args[]){
   
//throws exception
    Pattern p = Pattern.compile("[4");
  }
}

Does this mean that you have to catch a PatternSyntaxException every time you use a regular expression? No. PatternSyntaxException doesnít have to be explicitly caught, because it extends from a RuntimeException, and a RuntimeException doesnít need to be explicitly caught.

The compile(String regex) method returns a Pattern object.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials