Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).
Regular Expressions - Character Classes (Page 4 of 11 )
There are times when you need to describe your search criteria as a class—that is, as a group that shares potentially complex commonalities that you need to be able to describe and for which there are no predefined classes. Fortunately, regex provides a mechanism for doing so through character classes, as shown in Table 1-11.
Table 1-11. Character Classes
Pattern
Description
[abc]
a
, b, or c. (Of course, any character could be used, not just a, b, or c.)
[^abc]
Any character except
a, b, or c.
[a-zA-Z]
a
through z or A through Z.
[a-d[m-p]]
a
through d, or m through p: [a-dm-p].
[a-z&&[def]]
Whatever exists in both sets, namely
d, e, or f.
[a-z&&[^bc]]
a
through z, except for b and c: [ad-z].
[a-z&&[^m-p]]
a
through z, and not m through p: [a-lq-z].
There are also some predefined Portable Operating System Interface for UNIX (POSIX) character classes. These are American Standard Code for Information Interchange (ASCII) classes that experience has shown to be particularly useful. Thus, they’re already in place, and you can simply refer to them for use. Table 1-12 contains the POSIX character classes.
Table 1-12. POSIX Character Classes
Pattern
Description
\p{Lower}
A lowercase letter:
[a-z]
\p{Upper}
An uppercase letter:
[A-Z]
\p{ASCII}
All ASCII characters:
[\x00-\x7F]
\p{Alpha}
An upper- or lowercase letter:
[\p{Lower}\p{Upper}]
\p{Digit}
A digit:
[0-9]
\p{Alnum}
A number or a letter:
[\p{Alpha}\p{Digit}]
\p{Punct}
Punctuation: one of
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}
Any visible character:
[\p{Alnum}\p{Punct}]
\p{Print}
A printable character:
[\p{Graph}]
\p{Blank}
A tab or space
\p{Cntrl}
A control character:
[\x00-\x1F\x7F]
\p{XDigit}
A hexadecimal digit:
[0-9a-fA-F]
\p{Space}
A whitespace character:
[ \t\n\x0B\f\r]
Simple Class Example
Let’s step through some simple examples. The pattern [0-5] will match any part of the input that contains a digit between 0 and 5. Thus, it will match on 0, 1, 2, 3, 4, or 5. It won’t match 8, 6, or any nondigit characters. Table 1-13 dissects the pattern.
Table 1-13. The Pattern
[0-5]
Regex
Description
[
A class consisting of
0
The digit 0
-
Ranging through
5
The digit 5
]
Close class
*
In English: Look for any digit ranging from 0 to 5, including 0 and 5.
Negation Example
The pattern [^A] will match any character except the character A. This includes other characters, spaces, tabs, punctuation, and so on. It’s important to notice that the ^ delimiter only has a not meaning when inside a class bracket—that is, inside the [ and ]brackets. Outside those brackets, it stands for the beginning of the line character. I cover this topic in more detail later. Table 1-14 dissects the pattern.
Table 1-14. The Pattern
[^A]
Regex
Description
[
A class consisting of
^
Any character except
A
The character
A
]
Close class
*
In English: Look for any character except the capital letter A.
Groups and Back References
Groups are simply logical divisions of the text. When you describe a group in regex, you’re providing a mechanism for the JVM to treat characters that fall into that group in a specific way.
Back references allow the regex pattern to refer to a group, even as it’s in the middle of an operation. A pattern can refer to the last group it found, or the one before that, or even one further down the execution chain.
In the sections that follow, I cover the topics of groups and back references in more detail and present an example for each.
Groups
A group is a submatch. If you’re familiar with SQL, it might be helpful to think of groups as the SQL equivalent of a subquery. Groups allow you to define parts of your pattern as logical subunits of the whole and then refer to the results of those subunits. Their syntax follows in Table 1-15.
Table 1-15. Groups
Regex
Description
(
A group consisting of
…
Any regex pattern
)
Close group
Groups Example
As with most things, an example can be more illuminating than a description. Consider the pattern (\w+)_(\w+)@(\w+)\.orgto match e-mail patterns. Table 1-16 dissects this pattern.
Table 1-16. The Pattern(\w+)_(\w+)@(\w+)\.org
Regex
Description
(
A group consisting of
\w
An alphanumeric or underscore character
+
Repeated one or more times
)
Close group
_
Followed by an underscore character
(
A group consisting of
\w
One alphanumeric or underscore character
+
Followed by one or more alphanumeric characters
)
Close group
@
Followed by an at character
(
A group consisting of
\w
One alphanumeric or underscore character
+
Followed by one or more alphanumeric or underscore characters
)
Close group
\.
Followed by the period character
o
Followed by the character
o
r
Followed by the character
r
g
Followed by the character
g
*
In English: Look for a group of alphanumeric characters, followed by _, followed by a group of alphanumeric characters, followed by @, followed by a group of alphanumeric characters, followed by .org.