Introduction to the Java.util.regex Object Model
(Page 1 of 12 )
If you have ever wanted to know all about the Pattern and Matcher classes of Java's new java.util.regex package, this article is an excellent place to start. It is taken from chapter 2 of the book
Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).
“How do you eat an elephant? One bite at a time.”
— Proverb
JAVA’S NEW java.util.regex package offers an elegant and agile object model with which to meet regular expression needs. It is composed, in its entirety, of three objects: the Pattern object, the Matcher object, and a PatternSyntaxException. This chapter details all of the methods and fields of the Pattern and Matcher classes, and provides examples of their use. In this chapter I also discuss the new regular expression supportive methods retrofitted into the String class.
The chapter starts with an examination of the Pattern class. I cover every method and field and often provide examples. I also explore the Matcher class in similar detail. Finally, I round out the chapter by discussing the regex methods that have been added to the String class.
The Pattern Object Figure 2-1 shows the methods of the Pattern class. The figure is a UML rendering of the Pattern class that illustrates the various methods and constants of the class.

Figure 2-1. A UML representation of the Pattern class
Let’s examine the fields and methods of the Pattern class in detail. If you aren’t familiar with UML, here’s a quick guide to reading Figure 2-1:
- The name of the class is Pattern, and it’s in the topmost section of the rectangle.
- The middle section is a grouping of the field variables. The plus sign (+) preceding these indicates that they’re public. The underline indicates that they’re static, and the “: int” indicates that they’re of type int. The “= num” indicates the default value.
- The bottommost section of the rectangle holds the class’s methods. Again, the plus sign indicates public access, and the underline indicates that the method is static. The parentheses designate the parameters for a given method; thus, flags() takes no parameters, whereas matcher(input : CharSequence) takes variable named input of type CharSequence. The colon (:) toward the end indicates a type.
The following sections describe the fields and methods of the Pattern class.
public static final int UNIX_LINES The UNIX_LINES flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. Use this flag when you parse data that originates on a UNIX machine.
On many flavors of UNIX, the invisible character \n is used to note termination of a line. This is distinct from other operating systems, including flavors of Windows, which may use \r\n,\n,\r, \u2028, or \u0085 for a line terminator.
If you transport a file that originated on a UNIX machine to a Windows platform and open it, you may notice that the lines will sometimes not terminate in the expected manner, depending on which editor you use to view the text. This happens because the two systems can use different syntax to denote the end of a line.
The UNIX_LINES flag simply tells the regex engine that it’s dealing with UNIX-style lines, which affects the matching behavior of the regular expression metacharacters ^ and $. Using the UNIX_LINES flag, or the equivalent (?d) regex pattern, doesn’t degrade performance. By default, this flag isn’t set.
public static final int CASE_INSENSITIVE The CASE_INSENSITIVE field is used when constructing the second parameter of the Pattern.compile(String regex, int flags) method. It’s useful when you need to match ASCII characters, regardless of case.
Using this flag or the equivalent (?I) regular expression can cause performance to degrade slightly. By default, this flag isn’t set.
public static final int COMMENTS The COMMENTS flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that the regex pattern has an embedded comment in it. Specifically, it tells the regex engine to ignore any comments in the pattern, starting with the spaces leading up to the # character and everything thereafter, until the end of the line.
Thus, the regex pattern A #matches uppercase US-ASCII char code 65 will use A as the regular expression, but the spaces leading to the # character and everything after it until the end of the line will be ignored. Your code might end up looking something like this:
Pattern p=
Pattern.compile("A #matches uppercase US-ASCII char code 65",Pattern.COMMENTS);
Think of the # character as the regex equivalent of the java comment //. By using the COMMENTS flag in compiling your regex, you’re telling the regex engine that your expression contains comments, which should be ignored. This can be useful if your pattern is particularly complex or subtle. When you don’t set this flag, the regex engine will attempt to interpret and use your comments as part of the regular expression.
Using this flag or the equivalent (?x) regular expression doesn’t degrade performance.
public static final int MULTILINE The MULTILINE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It tells the regex engine that regex input isn’t a single line of code; rather, it contains several lines that have their own termination characters.
This means that the beginning-of-line character, ^, and the end-of-line character, $, will potentially match several lines within the input String. For example, imagine that your input String is This is a sentence. \n So is this.. If you use the MULTILINE flag to compile the regular expression pattern
Pattern p = Pattern.compile("^", Pattern.MULTILINE);
then the beginning of line character, ^, will match before the T in This is a sentence. It will also match just before the S in So is this. When you don’t use the MULTILINE flag, the match will only find the T in This is a sentence.
Using this flag or the equivalent (?m) regular expression may degrade performance.
public static final int DOTALL The DOTALL flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. The DOTALL flag tells the regex engine to allow the metacharacter period to match any character, including line termination characters. What does this mean?
Imagine that your candidate String is Test\n. If your corresponding regex pattern is.you would normally have four matches: one for the T, another for the e, another for s, and the fourth for t. This is because the regex metacharacter.will normally match any character except a line termination character.
Enabling the DOTALL flag as follows:
Pattern p = Pattern.compile(".", Pattern.DOTALL);
would generate five matches. Your pattern would match the T, e, s, and t characters. In addition, it would match the \n character at the end of the line.
Using this flag or the equivalent (?s) regular expression doesn’t degrade performance.
public static final int UNICODE_CASE The UNICODE_CASE flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. It is used in conjunction with the CASE_INSENITIVE flag to generate case-insensitive matches for the international character sets.
Using this flag or the equivalent (?u) regular expression can degrade performance.
public static final int CANON_EQ The CANON_EQ flag is used in constructing the second parameter of the Pattern.compile(String regex, int flags) method. As you know, characters are actually stored as numbers. For example, in the ASCII character set, the character A is represented by the number 65. Depending on the character set that you’re using, the same character can be represented by different numeric combinations. For example, à can be represented by both +00E0 and U+0061U+0300. A CANON_EQ match would match either representation.
Using this flag may degrade performance.
public static Pattern compile(String regex) Throws a PatternSyntaxException
You’ll notice that the Pattern class doesn’t have a public constructor. This means that you can’t write the following type of code:
Pattern p = new Pattern("my regex");//wrong!
To get a reference to a Pattern object, you must use the static method compile(String regex). Thus, your first line of regex code might look like the following:
Pattern p = Pattern.compile("my regex");//Right!
The parameter for this method is a String that represents a regular expression. When you pass a String to a method that expects a regular expression, it’s important to delimit any \characters that the regular expressions might have by appending another \character to them. This is because String objects internally use the \character to delimit metacharacters in character sequences, regardless of whether those character sequences are regular expressions. This was true long before regular expressions were part of Java. Thus, the regular expression \d becomes \\d. To match a single digit, your regular expression becomes the following:
Pattern p = Pattern.compile("\\d");
The point here being that the regular expression \d becomes the String \\d.
The delimitation of the String parameter can sometimes be tricky, so it’s important to understand it well. By and large, it means that you double the \characters that might already be present in the regular expression. It does not mean that you simply append a single \character. I present an example to illustrate this shortly.
The compile method will throw a java.util.regex.PatternSyntaxException if the regular expression itself is badly formed. For example, if you pass in a String that contains [4, the compile method will throw a PatternSyntaxException at runtime, because the syntax of the regular expression [4 is illegal, as shown in Listing 2-1.
Listing 2-1. Using the compile Method
import java.util.regex.*;
public class DelimitTest{
public static void main(String args[]){
//throws exception
Pattern p = Pattern.compile("[4");
}
}
Does this mean that you have to catch a PatternSyntaxException every time you use a regular expression? No. PatternSyntaxException doesn’t have to be explicitly caught, because it extends from a RuntimeException, and a RuntimeException doesn’t need to be explicitly caught.
The compile(String regex) method returns a Pattern object.
Next: public static Pattern compile(String regex, int flags) Throws a PatternSyntaxException >>
More Java Articles
More By Apress Publishing
|
This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.
|
|