Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).
Crawling the Web with Java - An Overview of Regular Expression Processing (Page 11 of 15 )
An Overview of Regular Expression Processing
As the term is used here, a regular expression is a sequence of characters that describes a character sequence. This general description, called a pattern, can then be used to find matches in other character sequences. Regular expressions can specify wildcard characters, sets of characters, and various quantifiers. Thus, you can specify a regular expression that represents a general form that can match several different specific character sequences. There are two classes that support regular expression processing: Pattern and Matcher. You use Pattern to define a regular expression. To match the pattern against another sequence, use Matcher.
The Pattern class defines no constructors. Instead, a pattern is created by calling the compile( ) factory method. The form used here is
static Pattern compile(String pattern, int options)
Here, pattern is the regular expression that you want to use, and options specifies one or more options that affect matching. The option used by Search Crawler is Pattern.CASE_INSENSITIVE, which causes the case of the strings to be ignored. The compile( ) method transforms the string in pattern into a pattern that can be used for pattern matching by the Matcher class. It returns a Pattern object that contains the pattern.
Once you have created a Pattern object, you will use it to create a Matcher. This is done by calling the matcher( ) factory method defined by Pattern. It is shown here:
Matcher matcher(CharSequence str)
Here, str is the character sequence that the pattern will be matched against. This is called the input sequence. CharSequence is an interface that was added by Java 2, v1.4 and defines a read-only set of characters. It is implemented by the String class, among others. Thus, you can pass a string to matcher( ).
You will use methods defined by Matcher to perform various pattern-matching operations. The ones used by retrieveLinks( ) are find( ) and group( ). The find( ) method determines if a subsequence of the input sequence matches the pattern. The version used by Search Crawler is shown here:
boolean find( )
It returns true if there is a matching subsequence and false otherwise. This method can be called repeatedly, allowing it to find all matching subsequences. Each call to find( ) begins where the previous one left off.
You can obtain a string containing a matching sequence by calling group( ). The form used by Search Crawler is shown here:
String group(int which)
Here, which specifies the sequence (group of characters), with the first group being 1. The matching string is returned.
Regular Expression Syntax
The syntax and rules that define a regular expression are similar to those used by Perl 5. Although no single rule is complicated, there are a large number of them, and a complete discussion is beyond the scope of this book. However, a few of the more commonly used constructs are described here.
In general, a regular expression is comprised of normal characters, character classes (sets of characters), wildcard characters, and quantifiers. A normal character is matched as is. Thus, if a pattern consists of "xy", the only input sequence that will match it is "xy". Characters such as newlines and tabs are specified using the standard escape sequences, which begin with a backslash (\). For example, a newline is specified by \n. In the language of regular expressions, a normal character is also called a literal.
A character class is a set of characters. A character class is specified by putting the characters in the class between brackets. For example, the class [wxyz] matches w, x, y, or z. To specify an inverted set, precede the characters with a circumflex (^). For example, [^wxyz] matches any character except w, x, y, or z. You can specify a range of characters using a hyphen. For example, to specify a character class that will match the digits 1 through 9, use [1–9].
The wildcard character is the dot (.), and it matches any character. Thus, a pattern that consists of "." will match these (and other) input sequences: "A", "a", "x", and so on.
A quantifier determines how many times an expression is matched. The quantifiers are shown here:
+ Match one or more.
* Match zero or more.
? Match zero or one.
For example, the pattern "x+" will match "x", "xx", and "xxx", among others.