Java
  Home arrow Java arrow Page 3 - Regular Expressions
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Dedicated Servers  
Actuate Whitepapers 
VeriSign Whitepapers 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Regular Expressions
By: Apress Publishing
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 5 stars5 stars5 stars5 stars5 stars / 10
    2005-07-28

    Table of Contents:
  • Regular Expressions
  • Creating Patterns
  • Common and Boundary Characters
  • Character Classes
  • Back References
  • Integrating Java with Regular Expressions
  • Confirming Name Formats Example
  • Finding Duplicate Words Example
  • Regular Expression Operations
  • Search and Replace
  • Comparing Regex and Perl

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT

    Stay one step ahead of the competition. Evaluate and give feedback on some of the hottest web development tools on the market today. Make your opinion heard! Click Here

    Regular Expressions - Common and Boundary Characters


    (Page 3 of 11 )

    Regular expressions also contain characters that take on special meaning when they’re delimited by the \character. These facilitate finding common tokens, such as word boundaries, empty spaces, tabs, alphanumeric characters, and so on. For example, \n and \t are special characters that represent a newline and a tab, respectively.

    In this section, I cover these common boundary characters and provide examples of their use.

    Common Characters

    Certain types of characters occur often enough that regular expression languages have developed a shorthand for referring to them. For example, a digit is designated by the \d expression. Without the \character delimiting the d, the expression would simply refer to the fourth letter of the English alphabet, in lowercase. Table 1-3 lists some of these common characters.

    Table 1-3. Common and Boundary Characters

    Character

    Description

    .

    Matches any character; may also match line terminators.

    \d

    A digit [0-9]. This will match any single digit from 0to 9. Notice that an input of 19will need to match twice: Once for the 1and once again for the 9.

    \D

    A nondigit [^0-9]. This will match any character that isn’t a digit, including a whitespace character.

    \w

    A word character [a-zA-Z_0-9]. This will match any character from a to zor Ato Z, an underscore, or any single digit from 0to 9.

    \W

    A nonword character [^\w]. This will match any character that isn’t a word character, such as a number, including whitespace characters.

    \t

    The tab character.

    \n

    The newline (linefeed) character.

    \r

    The carriage-return character.

    \f

    The form-feed character.

    \s

    A whitespace character. This includes the newline, carriage-return, tab, form-feed, and end-of-line characters.

    \S

    A non-whitespace character, also known as [^\s]. This will match any character that isn’t a whitespace character, as described previously.

    ^

    The beginning of a line.

    $

    The end of a line.

    \b

    A word boundary. A word boundaryis the character immediately preceding what we think of as "words" in English vernacular, corresponding to \wpreviously. It will also match the character immediately following a word. Most often, this character matches a space, a tab, an end of a line, or a beginning of a line.

    \B

    A non–word boundary.

    Common Characters Example

    Imagine that you need to verify that a given String consists of any alphanumeric character, including underscores, followed by a digit. Thus, you would acceptA1, but not !1, because the ! symbol isn’t an alphanumeric character or an underscore. The pattern you want in this case consists of an alphanumeric character (or underscore) followed by a digit; thus, \w\d, per Table 1-1.

    The pattern \w\d will match h1, k9, A1, or l1, because each consists of an alphanumeric character followed by a digit. It won’t match AA, 9A, or *5, because these don’t consist of an alphanumeric character followed by a digit. Table 1-4 dissects the pattern.

    Table 1-4. The Pattern \w\d  

    Regex

    Description

    \w

    Any character ranging from ato z, Ato Z, 0 to 9, or an underscore

    \d

    Followed by a single digit ranging from 0to 9

    * In English: Look for any alphanumeric character, or the underscore character, followed by a single digit.

    Boundary Characters

    Regular expressions also provide a mechanism for finding common character boundaries. These include newlines, end-of-line characters, end-of-file characters, tabs, and so on. These are listed in the latter part of Table 1-3.

    Boundary Characters Example

    Say you want to match the word anna from an input string, but only if it’s at the beginning of a word. Thus, Hanna wouldn’t fit your criteria. The pattern you want in this case consists of a word boundary, \b, followed by the characters a, n, n, and a, thus the regex \banna.

    The pattern \banna will match anna but not Hanna, because anna is a cluster of characters preceded by a space character. A space character meets the criterion of being a word boundary. This isn’t true of Hanna, because the character immediately preceding the a character in Hanna is an H, and H isn’t a word boundary. Table 1-5 dissects the pattern.

    Table 1-5. The Pattern \banna  

    Regex

    Description

    \b

    A word boundary

    a

    Followed by the character a

    n

    Followed by the character n

    n

    Followed by the character n

    a

    Followed by the character a

    * In English: Look for anna if it is the beginning of a word.

    Quantifiers and Alternates

    Quantifiers and alternates allow you to specify the number of tokens you need to find or alternative tokens you’re willing to accept. Table 1-6 lists some of the quantifiers and alternates in regex.

    Table 1-6. Quantifiers  

    Regex

    Description

    ?

    The preceding is repeated once or not at all.

    *

    The preceding is repeated zero or more times.

    +

    The preceding is repeated one or more times.

    {n}

    The preceding is repeated exactly n times.

    {n,}

    The preceding is repeated at least n times.

    {n,m}

    The preceding is repeated at least n times, but no more than m times. This includes m repetitions.

    |

    The element preceding the | or the element following it.

    The following sections offer some examples that demonstrate working with quantifiers.

    Repeated Characters Example 1

    The pattern An+a will match Ana, Anna, or Annnna because each contains at least one A character immediately followed by one or more n characters followed by an a character. It won’t match Aa or ANna because these don’t consist of a single A character immediately followed by at least one n character followed by an a character. Notice that a capital N and a lowercase n aren’t considered matches. Table 1-7 dissects the pattern.

    Table 1-7. The Pattern An+a  

    Regex

    Description

    A

    The character A

    n+

    Followed by one or more ncharacters

    a

    Followed by the character a

    * In English: Look for a capital A, followed by one or more n characters, followed by an a character.

    There is some interesting behavior that can be elicited here. If this match had been performed using the String.matches method, the pattern would not have matched AnnaMarie, because the String.matches method requires an exact match, and the Marie part of AnnaMarie would have ruined that exactness. However, the Matcher.find method would have matched AnnaMarie because it’s more permissive. Stay tuned—more details coming soon.

    Repeated Characters Example 2

    The pattern A{2,7} will match AA,AAAA, or AAAAAAA because each of these contains at least at least two A characters and no more than seven A characters. The pattern won’t match A because it contains less than two A characters, and the pattern won’t match AAAAAAAA because it contains more than seven A characters. Table 1-8 dissects the pattern.

    Table 1-8. The Pattern A{2,7}  

    Regex

    Description

    A

    The character A

    {

    Open repeating group

    2

    Repeated at least two times

    ,

    But not more than

    7

    Seven times

    }

    Close repeated group

    * In English: Look for a sequence of the character Arepeated two, three, four, five, six, or seven times.

     

    NOTE   In the example at the beginning of this chapter, you needed a pattern to match four consecutive digits and derived \d\d\d\d. As noted, this isn’t the most elegant pattern possible. An alternative, yet equivalent, way of expressing the same pattern is \d{4}, per Table 1-6—that is, a sequence of exactly four digits.

    Alternative Characters Example 1  

    The pattern A|B will match A or B, because each consists of either an A character or a B character. It won’t match P, Q, or jelly because these don’t consist strictly of either an A or a B character. Table 1-9 dissects this pattern.

    Table 1-9. The Pattern A|B  

    Regex

    Description

    A

    The character A

    |

    Or

    B

    The character B

    * In English: Look for either a capital A or a capital B.

    Alternative Characters Example 2

    The pattern anna|marie will match anna or marie, because anna matches the first alternative and marie matches the second. It won’t match Josie, Ralph, or Doctor. Table 1-10 dissects the pattern.

    Table 1-10. The Pattern anna|marie  

    Regex

    Description

    anna

    The characters a, n, n, and a, in order

    |

    Or

    marie

    The characters m, a, r, i, and e, in order

    * In English: Look for either the word anna or the word marie.

    So would the pattern match annamarie as a single word? In a word, maybe. I provide detailed information about this topic in later chapters, but here’s the nickel tour. Java 2 Enterprise Edition’s (J2EE’s) regex allows you to specify whether you need an exact or partial match. Thus, annamarie would match the pattern anna|marie twice for a partial match, and not at all for an exact match. Without going into too much detail, String.matches only provides for exact matches, whereas the Matcher class can provide more lenient matches using the find method.

    What about the pattern Miss anna|marie? Will it match Miss marie and Miss anna, or just one of them? Or will it match neither? A strict match will match Miss anna but reject Miss marie. The alternative | will read Miss anna as a single option and the pattern marie as another. Because the pattern maria isn’t equal to the candidate Miss maria, the search will reject Miss maria.

    More Java Articles
    More By Apress Publishing


     

    Buy this book now. This article is excerpted from Java Regular Expressions: Taming the java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway