Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).
Regular Expressions - Confirming Name Formats Example (Page 7 of 11 )
The code in Listing 1-5 determines if the given name meets the criterion of being well formatted. It looks for a first name token, an optional middle name token, and finally a last name token. For this example’s purposes, a name token consists of a capital letter followed by any number of lowercase letters.
This example is interesting because it takes advantage of Java’s robustness to a degree that the previous example didn’t. Specifically, you define what you mean when you say a “name token”:
String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)";
Then you use that definition later:
String namePattern = "("+nameToken+"){2,3}";
NOTE \p{Upper} and \p{Lower} are described shortly. They simply mean any uppercase character and any lowercase character, respectively.
This helps to keep the regex pattern from becoming overwhelming, and it also helps to isolate errors. As the examples in this book grow more ambitious, you’ll start to see that coupling regular expressions with Java’s powerful language can offer benefits that would, at best, be terse using regular expressions alone. Listing 1-5 shows the program MatchNameFormats.java, Output 1-5 shows the result of running the program, and Table 1-22 dissects the pattern.
Listing 1-5.MatchNameFormats.java
import Java.util.regex.*; import java.io.*; public class MatchNameFormats{ public static void main(String args[]){ isNameValid(args[0]); } /** * Confirms that the format for the given name is valid. * @param name is a String representing the name. * @returns true if the name format is acceptable. */ public static boolean isNameValid(String name){ boolean retval=false; String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)"; String namePattern = "("+nameToken+"){2,3}"; retval = name.matches(namePattern); //prepare a message indicating success or failure String msg = "NO MATCH: pattern:" + name + "\r\n regex :" + namePattern; if (retval){ msg = "MATCH pattern:" + name + "\r\n regex :" + namePattern; } System.out.println(msg +"\r\n"); return retval; } }
Output 1-5. Result of Running MatchNameFormats.java
------------------------------------------------------------------ C:\RegEx\Examples\chapter1>java MatchNameFormats "John Smith" MATCH pattern:John Smith regex :(\p{Upper}(\p{Lower}+\s?)){2,3} C:\RegEx\Examples\chapter1>java MatchNameFormats "John McGee" MATCH pattern:John McGee regex :(\p{Upper}(\p{Lower}+\s?)){2,3} C:\RegEx\Examples\chapter1>java MatchNameFormats "John Willliam Smith" MATCH pattern:John Willliam Smith regex :(\p{Upper}(\p{Lower}+\s?)){2,3} C:\RegEx\Examples\chapter1>java MatchNameFormats "John Q Smith" NO MATCH: pattern:John Q Smith regex :(\p{Upper}(\p{Lower}+\s?)){2,3} C:\RegEx\Examples\chapter1>java MatchNameFormats "John allen Smith" NO MATCH: pattern:John allen Smith regex :(\p{Upper}(\p{Lower}+\s?)){2,3} C:\RegEx\Examples\chapter1>java MatchNameFormats "John" NO MATCH: pattern:John regex :(\p{Upper}(\p{Lower}+\s?)){2,3}
Table 1-22.The Pattern
(\p{Upper}(\p{Lower}+\s?)){2,3}
Regex
Description
(
A group consisting of
\p{Upper}
An uppercase character
(
Followed by a inner group consisting of
\p{Lower}
A lowercase character
+
Repeated one or more times
\s?
Followed by an optional space
)
The end of the inner group
Table 1-22. The Pattern
(\p{Upper}(\p{Lower}+\s?)){2,3}(Continued)
Regex
Description
)
The end of the outer group
{
Repeated at least
2
Two times
,
But no more than
3
Three times
}
End repetition
*
In English: Look for two or three words beginning with a capital letter followed by any numb of lowercase letters. Each word could be followed by a single space.
A couple of questions naturally arise from this example:
Why did John Q Public fail? Because Q is not a name token, as you’ve defined name tokens (i.e., a capital letter followed by one or more lowercase letters).
Why did John allen Smith fail? Because allen doesn’t start with a capital letter.
Why did John fail? Although John is a valid name token, it isn’t repeated two or three name tokens. It’s simply one name token.
Why did John McGee pass? McGee isn’t an uppercase letter followed by any number of lowercase letters.Try to puzzle this one out on your own. It’s answered in the “FAQs” section at the end of the chapter.
This example uses the composition technique mentioned at the beginning of this chapter. That is, it uses patterns previous defined to compose a new pattern. If you think about it, this is a very engineer-like thing to do: Build small blocks, then use those blocks to build more complicated pieces.
Confirming Addresses Example
The code in Listing 1-6 simply determines if the given address meets the criterion of being well formatted. It takes advantage of the name and zip code patterns created earlier, and it adds its own address pattern. Output 1-6 shows the result of running the program. Table 1-23 dissects the pattern.
Listing 1-6.MatchAddress.java
import java.util.regex.*; import java.io.*; public class MatchAddress{ public static void main(String args[]){ isAddressValid(args[0]); } /** * Confirms that the format for the given address is valid. * @param addr is a String representing the address * @returns true if the zip code format is acceptable. */ public static boolean isAddressValid(String addr){ boolean retval = false; //use the name pattern created earlier. String nameToken ="\\p{Upper}(\\p{Lower}+\\s?)"; String namePattern = "("+nameToken+"){2,3}"; //use the zip code pattern created earlier. String zipCodePattern = \\d{5}(-\\d{4})?; //construct an address pattern String addressPattern = "^" + namePattern + "\\w+ .*, \\w+ " + zipCodePattern +"$"; retval= addr.matches(addressPattern); //prepare a message indicating success or failure String msg = "NO MATCH\npattern:\n" + addr + "\nregexLength:\n " + addressPattern; if (retval){ msg = "MATCH\npattern:\n" + addr + "\nregexLength:\n " + addressPattern; } System.out.println(msg +"\r\n"); return retval; } }
Output 1-6. Result of Running MatchAddress.java
------------------------------------------------------------------ C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "John Smith 888 Luck Street, NY 64332" MATCH pattern: John Smith 888 Luck Street, NY 64332 regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "John A. Smith 888 Luck Stree t, NY 64332-4453" NO MATCH pattern: John A. Smith 888 Luck Street, NY 64332-4453 regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "John Allen Smith 888 Luck Street, NY 64332-4453" MATCH pattern: John Allen Smith 888 Luck Street, NY 64332-4453 regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "888 Luck Street, NY 64332" NO MATCH pattern: 888 Luck Street, NY 64332 regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "P.O. BOX 888 Luck Street, NY 64332-4453" NO MATCH pattern: P.O. BOX 888 Luck Street, NY 64332-4453 regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
C:\RegEx\chapter_1\Examples\chapter1> java MatchAddress "John Allen Smith 888 Luck st., NY" NO MATCH pattern: John Allen Smith 888 Luck st., NY regexLength: ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+\d{5}(-\d{4})?$
---------------------------------------------------------- Table 1-23. The Pattern ^(\p{Upper}(\p{Lower}+\s?)){2,3}\w+ .*, \w+ \d{5}(-\d{4})?$
In English: Look for a name token, as previously defined, followed by some words, a comma, and then more words, followed by a zip code. This example uses the composition technique.