Home arrow Java arrow Page 3 - Regular Expressions
JAVA

Regular Expressions


Regular expressions are a mechanism for telling the Java Virtual Machine (JVM) how to find and manipulate text for you. Using regular expressions to do this is different from the traditional approach. This article compares the two approaches. It is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 28
July 28, 2005
TABLE OF CONTENTS:
  1. · Regular Expressions
  2. · Creating Patterns
  3. · Common and Boundary Characters
  4. · Character Classes
  5. · Back References
  6. · Integrating Java with Regular Expressions
  7. · Confirming Name Formats Example
  8. · Finding Duplicate Words Example
  9. · Regular Expression Operations
  10. · Search and Replace
  11. · Comparing Regex and Perl

print this article
SEARCH DEVARTICLES

Regular Expressions - Common and Boundary Characters
(Page 3 of 11 )

Regular expressions also contain characters that take on special meaning when theyíre delimited by the \character. These facilitate finding common tokens, such as word boundaries, empty spaces, tabs, alphanumeric characters, and so on. For example, \n and \t are special characters that represent a newline and a tab, respectively.

In this section, I cover these common boundary characters and provide examples of their use.

Common Characters

Certain types of characters occur often enough that regular expression languages have developed a shorthand for referring to them. For example, a digit is designated by the \d expression. Without the \character delimiting the d, the expression would simply refer to the fourth letter of the English alphabet, in lowercase. Table 1-3 lists some of these common characters.

Table 1-3. Common and Boundary Characters

Character

Description

.

Matches any character; may also match line terminators.

\d

A digit [0-9]. This will match any single digit from 0to 9. Notice that an input of 19will need to match twice: Once for the 1and once again for the 9.

\D

A nondigit [^0-9]. This will match any character that isnít a digit, including a whitespace character.

\w

A word character [a-zA-Z_0-9]. This will match any character from a to zor Ato Z, an underscore, or any single digit from 0to 9.

\W

A nonword character [^\w]. This will match any character that isnít a word character, such as a number, including whitespace characters.

\t

The tab character.

\n

The newline (linefeed) character.

\r

The carriage-return character.

\f

The form-feed character.

\s

A whitespace character. This includes the newline, carriage-return, tab, form-feed, and end-of-line characters.

\S

A non-whitespace character, also known as [^\s]. This will match any character that isnít a whitespace character, as described previously.

^

The beginning of a line.

$

The end of a line.

\b

A word boundary. A word boundaryis the character immediately preceding what we think of as "words" in English vernacular, corresponding to \wpreviously. It will also match the character immediately following a word. Most often, this character matches a space, a tab, an end of a line, or a beginning of a line.

\B

A nonĖword boundary.

Common Characters Example

Imagine that you need to verify that a given String consists of any alphanumeric character, including underscores, followed by a digit. Thus, you would acceptA1, but not !1, because the ! symbol isnít an alphanumeric character or an underscore. The pattern you want in this case consists of an alphanumeric character (or underscore) followed by a digit; thus, \w\d, per Table 1-1.

The pattern \w\d will match h1, k9, A1, or l1, because each consists of an alphanumeric character followed by a digit. It wonít match AA, 9A, or *5, because these donít consist of an alphanumeric character followed by a digit. Table 1-4 dissects the pattern.

Table 1-4. The Pattern \w\d  

Regex

Description

\w

Any character ranging from ato z, Ato Z, 0 to 9, or an underscore

\d

Followed by a single digit ranging from 0to 9

* In English: Look for any alphanumeric character, or the underscore character, followed by a single digit.

Boundary Characters

Regular expressions also provide a mechanism for finding common character boundaries. These include newlines, end-of-line characters, end-of-file characters, tabs, and so on. These are listed in the latter part of Table 1-3.

Boundary Characters Example

Say you want to match the word anna from an input string, but only if itís at the beginning of a word. Thus, Hanna wouldnít fit your criteria. The pattern you want in this case consists of a word boundary, \b, followed by the characters a, n, n, and a, thus the regex \banna.

The pattern \banna will match anna but not Hanna, because anna is a cluster of characters preceded by a space character. A space character meets the criterion of being a word boundary. This isnít true of Hanna, because the character immediately preceding the a character in Hanna is an H, and H isnít a word boundary. Table 1-5 dissects the pattern.

Table 1-5. The Pattern \banna  

Regex

Description

\b

A word boundary

a

Followed by the character a

n

Followed by the character n

n

Followed by the character n

a

Followed by the character a

* In English: Look for anna if it is the beginning of a word.

Quantifiers and Alternates

Quantifiers and alternates allow you to specify the number of tokens you need to find or alternative tokens youíre willing to accept. Table 1-6 lists some of the quantifiers and alternates in regex.

Table 1-6. Quantifiers  

Regex

Description

?

The preceding is repeated once or not at all.

*

The preceding is repeated zero or more times.

+

The preceding is repeated one or more times.

{n}

The preceding is repeated exactly n times.

{n,}

The preceding is repeated at least n times.

{n,m}

The preceding is repeated at least n times, but no more than m times. This includes m repetitions.

|

The element preceding the | or the element following it.

The following sections offer some examples that demonstrate working with quantifiers.

Repeated Characters Example 1

The pattern An+a will match Ana, Anna, or Annnna because each contains at least one A character immediately followed by one or more n characters followed by an a character. It wonít match Aa or ANna because these donít consist of a single A character immediately followed by at least one n character followed by an a character. Notice that a capital N and a lowercase n arenít considered matches. Table 1-7 dissects the pattern.

Table 1-7. The Pattern An+a  

Regex

Description

A

The character A

n+

Followed by one or more ncharacters

a

Followed by the character a

* In English: Look for a capital A, followed by one or more n characters, followed by an a character.

There is some interesting behavior that can be elicited here. If this match had been performed using the String.matches method, the pattern would not have matched AnnaMarie, because the String.matches method requires an exact match, and the Marie part of AnnaMarie would have ruined that exactness. However, the Matcher.find method would have matched AnnaMarie because itís more permissive. Stay tunedómore details coming soon.

Repeated Characters Example 2

The pattern A{2,7} will match AA,AAAA, or AAAAAAA because each of these contains at least at least two A characters and no more than seven A characters. The pattern wonít match A because it contains less than two A characters, and the pattern wonít match AAAAAAAA because it contains more than seven A characters. Table 1-8 dissects the pattern.

Table 1-8. The Pattern A{2,7}  

Regex

Description

A

The character A

{

Open repeating group

2

Repeated at least two times

,

But not more than

7

Seven times

}

Close repeated group

* In English: Look for a sequence of the character Arepeated two, three, four, five, six, or seven times.

 

NOTE   In the example at the beginning of this chapter, you needed a pattern to match four consecutive digits and derived \d\d\d\d. As noted, this isnít the most elegant pattern possible. An alternative, yet equivalent, way of expressing the same pattern is \d{4}, per Table 1-6óthat is, a sequence of exactly four digits.

Alternative Characters Example 1  

The pattern A|B will match A or B, because each consists of either an A character or a B character. It wonít match P, Q, or jelly because these donít consist strictly of either an A or a B character. Table 1-9 dissects this pattern.

Table 1-9. The Pattern A|B  

Regex

Description

A

The character A

|

Or

B

The character B

* In English: Look for either a capital A or a capital B.

Alternative Characters Example 2

The pattern anna|marie will match anna or marie, because anna matches the first alternative and marie matches the second. It wonít match Josie, Ralph, or Doctor. Table 1-10 dissects the pattern.

Table 1-10. The Pattern anna|marie  

Regex

Description

anna

The characters a, n, n, and a, in order

|

Or

marie

The characters m, a, r, i, and e, in order

* In English: Look for either the word anna or the word marie.

So would the pattern match annamarie as a single word? In a word, maybe. I provide detailed information about this topic in later chapters, but hereís the nickel tour. Java 2 Enterprise Editionís (J2EEís) regex allows you to specify whether you need an exact or partial match. Thus, annamarie would match the pattern anna|marie twice for a partial match, and not at all for an exact match. Without going into too much detail, String.matches only provides for exact matches, whereas the Matcher class can provide more lenient matches using the find method.

What about the pattern Miss anna|marie? Will it match Miss marie and Miss anna, or just one of them? Or will it match neither? A strict match will match Miss anna but reject Miss marie. The alternative | will read Miss anna as a single option and the pattern marie as another. Because the pattern maria isnít equal to the candidate Miss maria, the search will reject Miss maria.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials