Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).
Regular Expressions - Back References (Page 5 of 11 )
Back references are one of the most powerful features offered by regular expressions. Unfortunately, programmers often skip over them because they’re not explained well in the regular expression literature. That’s a mistake I hope to rectify here.
Back references allow a pattern to refer back to parts of itself. They always refer back to groups that were enclosed by the “(” and the “)”characters. Table 1-17 presents the syntax for back references.
Table 1-17. Back References
Regex
Description
\1
The first group in the pattern
\2
The second group in the pattern
\n
The
nth group in the pattern
NOTE There are some idiosyncratic behaviors associated with how back references work in Java, which I explain later in this chapter and in Chapter 3. For right now, you have enough information on back references to get started.
Back References Example
Say you need to find matches in which a word is duplicated. That is, you don’t know what the word you’re looking for is, but you want to be alerted when the same word is repeated twice in a row. If you’ve used a word processor such as Microsoft Word, you’ll notice that the application does this automatically. Let’s explore how you might do this in Java.
You’ll use the pattern \b(\w+) \1\b, which is dissected in Table 1-18. This pattern matches pizza pizza, Faster pussycat kill kill, or Never Never Never Never Never because each contains a word that’s immediately repeated. It won’t match 222 2222, sara sarah, or Faster pussycat kill, kill because these don’t contain a word that’s immediately repeated. The latter group won’t match because 222 2222has a lingering 2 in the second set, sara sarah has a lingering h in the second word, and in Faster pussycat kill, kill the second kill is separated from the first by a comma.
Table 1-18. The Pattern face="courier new, courier, mono" size=2>\b(\w+) \1\b
Regex
Description
\b
A word boundary
(
Followed by a group consisting of
\w
Any alphanumeric character
+
Repeated one for more times
A
)
Close group
<space>
Followed by a space
\1
Followed by the exact group of characters captured previously a
\b
Followed by a word boundary
*
In English: Look for a word boundary, followed by a group of alphanumeric characters, followed by a space, followed by the exact same group of alphanumeric characters found previously, followed by a word boundary. In short, look for duplicate words.
In the next section, you’ll examine some practical examples with corresponding Java code.