Crawling the Web with Java - The downloadPage(), removeWwwFromURL(), and
(Page 10 of 15 )
The downloadPage( ) Method The downloadPage( ) method, shown here, simply does as its name implies: it downloads the Web page at the given URL and returns the contents of the page as a large string:
// Download page at given URL.
private String downloadPage(URL pageUrl) {
try {
// Open connection to URL for reading.
BufferedReader reader =
new BufferedReader(new InputStreamReader(
pageUrl.openStream()));
// Read page into buffer.
String line;
StringBuffer pageBuffer = new StringBuffer();
while ((line = reader.readLine()) != null) {
pageBuffer.append(line);
}
return pageBuffer.toString();
} catch (Exception e) {
}
return null;
}
Downloading Web pages from the Internet in Java is quite simple, as evidenced by this method. First, a BufferedReader object is created for reading the contents of the page at the given URL. The BufferedReader’s constructor is passed an instance of InputStreamReader, whose constructor is passed the InputStream object returned from calling pageUrl.openStream( ). Next, a while loop is used to read the contents of the page, line by line, until the reader.readLine( ) method returns null, signaling that all lines have been read. Each line that is read with the while loop is added to the pageBuffer StringBuffer instance. After the page has been downloaded, its contents are returned as a String by calling pageBuffer.toString( ).
If an error occurs when opening the input stream to the page URL or while reading the contents of the Web page, an exception will be thrown. This exception will be caught by the empty catch block. The catch block has purposefully been left blank so that execution will continue to the remaining return null line. A return value of null from this method indicates to callers that an error occurred.
The removeWwwFromUrl( ) Method The removeWwwFromUrl( ) method is a simple utility method used to remove the “www” portion of a URL’s host. For example, take the URL:
http://www.osborne.com
This method removes the “www.” piece of the URL, yielding:
http://osborne.com
Because many Web sites intermingle URLs that do and don’t start with “www”, the Search Crawler uses this technique to find the “lowest common denominator” URL. Effectively, both URLs are the same on most Web sites, and having the lowest common denominator allows the Search Crawler to skip over duplicate URLs that would otherwise be redundantly crawled.
The removeWwwFromUrl( ) method is shown here:
// Remove leading "www" from a URL's host if present.
private String removeWwwFromUrl(String url) {
int index = url.indexOf("://www.");
if (index != -1) {
return url.substring(0, index + 3) +
url.substring(index + 7);
}
return (url);
}
The removeWwwFromUrl( ) method starts out by finding the index of "://www." inside the string passed to url. The "://" at the beginning of the string passed to the indexOf( ) method indicates that "www" should be found at the beginning of a URL where the protocol is defined (for example, http://www.osborne.com). This way, URLs that simply contain the string "www" are not tampered with. If url contains "://www.", the characters before and after "www." are concatenated and returned. Otherwise, the string passed to url is returned.
The retrieveLinks( ) Method The retrieveLinks( ) method parses through the contents of a Web page and retrieves all the relevant links. The Web page for which links are being retrieved is stored in a large String object. To say the least, parsing through this string, looking for specific character sequences, would be quite cumbersome using the methods defined by the String class. Fortunately, beginning with Java 2, v1.4, Java comes standard with a regular expression API library that makes easy work of parsing through strings.
The regular expression API is contained in java.util.regex. The topic of regular expressions is fairly large, and a complete discussion is beyond the scope of this book. However, because parsing regular expressions is key to Search Crawler, a brief overview is presented here.
Next: An Overview of Regular Expression Processing >>
More Java Articles
More By McGraw-Hill/Osborne
|
This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.
|
|