Regular Expressions - Common and Boundary Characters
(Page 3 of 11 )
Regular expressions also contain characters that take on special meaning when they’re delimited by the \character. These facilitate finding common tokens, such as word boundaries, empty spaces, tabs, alphanumeric characters, and so on. For example, \n and \t are special characters that represent a newline and a tab, respectively.
In this section, I cover these common boundary characters and provide examples of their use.
Common Characters
Certain types of characters occur often enough that regular expression languages have developed a shorthand for referring to them. For example, a digit is designated by the \d expression. Without the \character delimiting the d, the expression would simply refer to the fourth letter of the English alphabet, in lowercase. Table 1-3 lists some of these common characters.
Table 1-3. Common and Boundary Characters
Character | Description |
. | Matches any character; may also match line terminators. |
\d | A digit [0-9]. This will match any single digit from 0to 9. Notice that an input of 19will need to match twice: Once for the 1and once again for the 9. |
\D | A nondigit [^0-9]. This will match any character that isn’t a digit, including a whitespace character. |
\w | A word character [a-zA-Z_0-9]. This will match any character from a to zor Ato Z, an underscore, or any single digit from 0to 9. |
\W | A nonword character [^\w]. This will match any character that isn’t a word character, such as a number, including whitespace characters. |
\t | The tab character. |
\n | The newline (linefeed) character. |
\r | The carriage-return character. |
\f | The form-feed character. |
\s | A whitespace character. This includes the newline, carriage-return, tab, form-feed, and end-of-line characters. |
\S | A non-whitespace character, also known as [^\s]. This will match any character that isn’t a whitespace character, as described previously. |
^ | The beginning of a line. |
$ | The end of a line. |
\b | A word boundary. A word boundaryis the character immediately preceding what we think of as "words" in English vernacular, corresponding to \wpreviously. It will also match the character immediately following a word. Most often, this character matches a space, a tab, an end of a line, or a beginning of a line. |
\B | A non–word boundary. |
Common Characters Example
Imagine that you need to verify that a given String consists of any alphanumeric character, including underscores, followed by a digit. Thus, you would acceptA1, but not !1, because the ! symbol isn’t an alphanumeric character or an underscore. The pattern you want in this case consists of an alphanumeric character (or underscore) followed by a digit; thus, \w\d, per Table 1-1.
The pattern \w\d will match h1, k9, A1, or l1, because each consists of an alphanumeric character followed by a digit. It won’t match AA, 9A, or *5, because these don’t consist of an alphanumeric character followed by a digit. Table 1-4 dissects the pattern.
Table 1-4. The Pattern
\w\d Regex | Description |
\w | Any character ranging from ato z, Ato Z, 0 to 9, or an underscore |
\d | Followed by a single digit ranging from 0to 9 |
* In English: Look for any alphanumeric character, or the underscore character, followed by a single digit. |
Boundary Characters
Regular expressions also provide a mechanism for finding common character boundaries. These include newlines, end-of-line characters, end-of-file characters, tabs, and so on. These are listed in the latter part of Table 1-3.
Boundary Characters Example
Say you want to match the word anna from an input string, but only if it’s at the beginning of a word. Thus, Hanna wouldn’t fit your criteria. The pattern you want in this case consists of a word boundary, \b, followed by the characters a, n, n, and a, thus the regex \banna.
The pattern \banna will match anna but not Hanna, because anna is a cluster of characters preceded by a space character. A space character meets the criterion of being a word boundary. This isn’t true of Hanna, because the character immediately preceding the a character in Hanna is an H, and H isn’t a word boundary. Table 1-5 dissects the pattern.
Table 1-5. The Pattern
\banna Regex | Description |
\b | A word boundary |
a | Followed by the character a |
n | Followed by the character n |
n | Followed by the character n |
a | Followed by the character a |
* In English: Look for anna if it is the beginning of a word. |
Quantifiers and Alternates
Quantifiers and alternates allow you to specify the number of tokens you need to find or alternative tokens you’re willing to accept. Table 1-6 lists some of the quantifiers and alternates in regex.
Table 1-6. Quantifiers
Regex | Description |
? | The preceding is repeated once or not at all. |
* | The preceding is repeated zero or more times. |
+ | The preceding is repeated one or more times. |
{n} | The preceding is repeated exactly n times. |
{n,} | The preceding is repeated at least n times. |
{n,m} | The preceding is repeated at least n times, but no more than m times. This includes m repetitions. |
| | The element preceding the | or the element following it. |
The following sections offer some examples that demonstrate working with quantifiers.
Repeated Characters Example 1
The pattern An+a will match Ana, Anna, or Annnna because each contains at least one A character immediately followed by one or more n characters followed by an a character. It won’t match Aa or ANna because these don’t consist of a single A character immediately followed by at least one n character followed by an a character. Notice that a capital N and a lowercase n aren’t considered matches. Table 1-7 dissects the pattern.
Table 1-7. The Pattern
An+a Regex | Description |
A | The character A |
n+ | Followed by one or more ncharacters |
a | Followed by the character a |
* In English: Look for a capital A, followed by one or more n characters, followed by an a character. |
There is some interesting behavior that can be elicited here. If this match had been performed using the String.matches method, the pattern would not have matched AnnaMarie, because the String.matches method requires an exact match, and the Marie part of AnnaMarie would have ruined that exactness. However, the Matcher.find method would have matched AnnaMarie because it’s more permissive. Stay tuned—more details coming soon.
Repeated Characters Example 2
The pattern A{2,7} will match AA,AAAA, or AAAAAAA because each of these contains at least at least two A characters and no more than seven A characters. The pattern won’t match A because it contains less than two A characters, and the pattern won’t match AAAAAAAA because it contains more than seven A characters. Table 1-8 dissects the pattern.
Table 1-8. The Pattern
A{2,7} Regex | Description |
A | The character A |
{ | Open repeating group |
2 | Repeated at least two times |
, | But not more than |
7 | Seven times |
} | Close repeated group |
* In English: Look for a sequence of the character Arepeated two, three, four, five, six, or seven times. |
| |
| NOTE In the example at the beginning of this chapter, you needed a pattern to match four consecutive digits and derived \d\d\d\d. As noted, this isn’t the most elegant pattern possible. An alternative, yet equivalent, way of expressing the same pattern is \d{4}, per Table 1-6—that is, a sequence of exactly four digits. |
Alternative Characters Example 1
The pattern A|B will match A or B, because each consists of either an A character or a B character. It won’t match P, Q, or jelly because these don’t consist strictly of either an A or a B character. Table 1-9 dissects this pattern.
Table 1-9. The Pattern
A|B Regex | Description |
A | The character A |
| | Or |
B | The character B |
* In English: Look for either a capital A or a capital B. |
Alternative Characters Example 2
The pattern anna|marie will match anna or marie, because anna matches the first alternative and marie matches the second. It won’t match Josie, Ralph, or Doctor. Table 1-10 dissects the pattern.
Table 1-10. The Pattern
anna|marie Regex | Description |
anna | The characters a, n, n, and a, in order |
| | Or |
marie | The characters m, a, r, i, and e, in order |
* In English: Look for either the word anna or the word marie. |
So would the pattern match annamarie as a single word? In a word, maybe. I provide detailed information about this topic in later chapters, but here’s the nickel tour. Java 2 Enterprise Edition’s (J2EE’s) regex allows you to specify whether you need an exact or partial match. Thus, annamarie would match the pattern anna|marie twice for a partial match, and not at all for an exact match. Without going into too much detail, String.matches only provides for exact matches, whereas the Matcher class can provide more lenient matches using the find method.
What about the pattern Miss anna|marie? Will it match Miss marie and Miss anna, or just one of them? Or will it match neither? A strict match will match Miss anna but reject Miss marie. The alternative | will read Miss anna as a single option and the pattern marie as another. Because the pattern maria isn’t equal to the candidate Miss maria, the search will reject Miss maria.
Next: Character Classes >>
More Java Articles
More By Apress Publishing
|
This article is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.
|
|