Home arrow Java arrow Page 11 - Crawling the Web with Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article

Crawling the Web with Java - An Overview of Regular Expression Processing
(Page 11 of 15 )

An Overview of Regular Expression Processing

As the term is used here, a regular expression is a sequence of characters that describes a character sequence. This general description, called a pattern, can then be used to find matches in other character sequences. Regular expressions can specify wildcard characters, sets of characters, and various quantifiers. Thus, you can specify a regular expression that represents a general form that can match several different specific character sequences. There are two classes that support regular expression processing: Pattern and Matcher. You use Pattern to define a regular expression. To match the pattern against another sequence, use Matcher.

The Pattern class defines no constructors. Instead, a pattern is created by calling the compile( ) factory method. The form used here is

  static Pattern compile(String pattern, int options)

Here, pattern is the regular expression that you want to use, and options specifies one or more options that affect matching. The option used by Search Crawler is Pattern.CASE_INSENSITIVE, which causes the case of the strings to be ignored. The compile( ) method transforms the string in pattern into a pattern that can be used for pattern matching by the Matcher class. It returns a Pattern object that contains the pattern.

Once you have created a Pattern object, you will use it to create a Matcher. This is done by calling the matcher( ) factory method defined by Pattern. It is shown here:

  Matcher matcher(CharSequence str)

Here, str is the character sequence that the pattern will be matched against. This is called the input sequence. CharSequence is an interface that was added by Java 2, v1.4 and defines a read-only set of characters. It is implemented by the String class, among others. Thus, you can pass a string to matcher( ).

You will use methods defined by Matcher to perform various pattern-matching operations. The ones used by retrieveLinks( ) are find( ) and group( ). The find( ) method determines if a subsequence of the input sequence matches the pattern. The version used by Search Crawler is shown here:

boolean find( )

It returns true if there is a matching subsequence and false otherwise. This method can be called repeatedly, allowing it to find all matching subsequences. Each call to find( ) begins where the previous one left off.

You can obtain a string containing a matching sequence by calling group( ). The form used by Search Crawler is shown here:

String group(int which)

Here, which specifies the sequence (group of characters), with the first group being 1. The matching string is returned.

Regular Expression Syntax

The syntax and rules that define a regular expression are similar to those used by Perl 5. Although no single rule is complicated, there are a large number of them, and a complete discussion is beyond the scope of this book. However, a few of the more commonly used constructs are described here.

In general, a regular expression is comprised of normal characters, character classes (sets of characters), wildcard characters, and quantifiers. A normal character is matched as is. Thus, if a pattern consists of "xy", the only input sequence that will match it is "xy". Characters such as newlines and tabs are specified using the standard escape sequences, which begin with a backslash (\). For example, a newline is specified by \n. In the language of regular expressions, a normal character is also called a literal.

A character class is a set of characters. A character class is specified by putting the characters in the class between brackets. For example, the class [wxyz] matches w, x, y, or z. To specify an inverted set, precede the characters with a circumflex (^). For example, [^wxyz] matches any character except w, x, y, or z. You can specify a range of characters using a hyphen. For example, to specify a character class that will match the digits 1 through 9, use [19].

The wildcard character is the dot (.), and it matches any character. Thus, a pattern that consists of "." will match these (and other) input sequences: "A", "a", "x", and so on.

A quantifier determines how many times an expression is matched. The quantifiers are shown here:

+     Match one or more.
*     Match zero or more.
?     Match zero or one.

For example, the pattern "x+" will match "x", "xx", and "xxx", among others.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials