Home arrow Java arrow Page 6 - Advanced Regex

Advanced Regex

Have you reached the point in your studies of J2SE that you want to learn about some of the more complex regex tools and concepts? This article introduces a variety of concepts, and offers some advice for increasing the efficiency of your regular expressions. It is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070).

Author Info:
By: Apress Publishing
Rating: 5 stars5 stars5 stars5 stars5 stars / 18
July 07, 2005
  1. · Advanced Regex
  2. · Noncapturing Subgroups
  3. · Greedy Qualifiers
  4. · Reluctant Qualifiers
  5. · Understanding Lookarounds
  6. · Zen and the Art of Efficient Expressions
  7. · Summary

print this article

Advanced Regex - Zen and the Art of Efficient Expressions
(Page 6 of 7 )

This section deals presents some techniques you can use to optimize your regular expressions in J2SE. Itís designed to help you deal with regular expressions that already exist, as well as establish a framework for writing new ones. The suggestions in this section, along with your own intuition and an active awareness of the nature of your data, will help you optimize your own regex.

Use Noncapturing Groups Where Possible

Capturing groups requires that the JVM keep track of them. This can be very helpful if you need to extract the groups or if you need back references to them later on in the expression. However, if youíre using groups strictly for logical purposes, itís worthwhile to make them noncapturing, as this conserves memory use.

The example given earlier for the pattern (?:good|bad|terrible|great) morning  groups the various kinds of mornings into a logical unit, but it has no need to capture them. Thus, the group is noncapturing and opens with ?: inside the opening parenthesis.

Precheck Your Candidate Strings

If youíre looking for a specific string, you might save CPU cycles by first making sure that the string in question actually exists in your candidate. For example, say you want to parse a string that might contain an e-mail address, and if it does, you want to extract the domain of the e-mail address. It makes sense to first make sure that the candidate contains an @ symbol, and then begin the regex search. Thus, the following is an inexpensive way to check for the necessity of even compiling your pattern:

if (candidate.indexOf("@")) //..try to extract domain information.

Offer the Most Likely Alternative First

Say your regular expression is designed to validate the title of members of an all-physiciansí health club. You expect 90 percent of your clients to have the title Dr, but some might not. Thus, you title-matching pattern should probably be something along the lines of .*\b(?:Dr|Mr|Mrs|Miss|Brother|Mister).*.

This pattern increases the likelihood of a successful match happening sooner rather than later, thus reducing the number of processes the regex engine has to step through. For example, imagine that the candidate string is Please meet Dr Hana Saez. The way the pattern is currently written, the engine will match the D, the r, the space, and everything thereafter on the first pass. Thus, it will never attempt Mr, Mrs, and so on.

However, if the pattern had been.*\b(?:Mr|Mrs|Miss|Brother|Mister|Dr) .*, the engine would have stepped through the entire candidate string once for Mr, then again for Mrs, then again for Miss, then again for Brother, and so on until it finally matched Dr. The actual result would been the same, but the net path there would have been much more resource intensive.

Be As Specific As Possible

If you know that your regex pattern should only match numbers at a given point, then donít use \w to define that part of the matchóuse \d. This will allow the regex engine to narrow the scope of its search and filter more quickly. Or, if you know that your pattern must contain a given word, then use that very word in the pattern. This is the same principle that makes it easier for you to write code to specific requirements. The more focused and detailed the requirements, the easier your job becomes.

For example, say you know that your candidate string must start with a capital letter, with lowercase letters following the initial character. Itís more efficient to use a pattern such as [A-Z][a-z]+ than it is use a pattern such as \w*. Both might work, but [A-Z][a-z]+ allows the engine to produce an accurate result faster than \w*, because it can refuse to consider digits and lowercase letters for the first character, and refuse to consider digits and uppercase letters for the latter parts.

The J2SE regex implementation is much faster with literal strings than it is with characters classesóit can literally zero-in on specific strings. Thus, if youíre looking for a number between 10 and 19, itís more efficient to use 1\d than to use \d+, \d\d, \d{2}, or any such variation.

Specify the Position of Your Match

If you know that the candidate string can only occur after a beginning of line, or right before an end of line, or after punctuation, then say so in your regex. The pattern ^ Beth will match faster and more efficiently than the pattern Beth, when you know that it must occur after a newline.

This type of optimization can be particularly powerful because it allows the engine to stop searching after examining only the first two characters of the candidate string. Look for opportunities to take advantage of this sort of thing in your regex.

Specify the Size of Your Match

If youíre looking for a match that must be at least n characters long, then say so in your regex. Or, if you know that your match canít be more then m characters long, say that too.

For example, imagine that youíre parsing a large file for references to first names. Itís probably reasonable to assume that the names you want contain fewer than 20 characters. So, although you could use a pattern like \b[A-Z][a-z]+, itís probably better to use something like \b[A-Z][a-z]{1,20}, because the engine can abandon any searches that are longer than 20 characters. Or, if you know that your candidate string must be six or more characters, itís better to use \w\w\w\w\w\w+ than \w+.

NOTE  J2SE regex finds specific repetitions much more quickly than quantified ones. Thus, \w\w\w\w\w\w is much faster than \w{6} .

Limit the Scope of Your Alternatives

Itís generally more efficient to offer small alternatives than to offer large ones, and itís better to offer them later than earlier. Thus, if you have the pattern Good Morning|Good Evening, you would be better served by the pattern Good (?:Morning|Evening). In the latter example, the regex engine doesnít have to make any decisions until after it has established that Good is part of the candidate string.

In the former example, the engine might be obligated to look twice, even if the candidate string doesnít contain the word Good. That is, even if the candidate string is Bad year and canít possibly match, the pattern Good Morning|Good Evening searches twice anyway: once for Good morning and then again for Good Evening.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials