Home arrow Java arrow Page 10 - Crawling the Web with Java
JAVA

Crawling the Web with Java


Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
TABLE OF CONTENTS:
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article
SEARCH DEVARTICLES

Crawling the Web with Java - The downloadPage(), removeWwwFromURL(), and
(Page 10 of 15 )

The downloadPage( ) Method

The downloadPage( ) method, shown here, simply does as its name implies: it downloads the Web page at the given URL and returns the contents of the page as a large string:

// Download page at given URL.
private String downloadPage(URL pageUrl) {
 
try {
     // Open connection to URL for reading.
     BufferedReader reader =
      
new BufferedReader(new InputStreamReader(
         pageUrl.openStream()));
    
// Read page into buffer.
     String line;
     StringBuffer pageBuffer = new StringBuffer();
     while ((line = reader.readLine()) != null) {
      
pageBuffer.append(line);
     }
    
return pageBuffer.toString();
  } catch (Exception e) {
  }
 
return null;
}

Downloading Web pages from the Internet in Java is quite simple, as evidenced by this method. First, a BufferedReader object is created for reading the contents of the page at the given URL. The BufferedReader’s constructor is passed an instance of InputStreamReader, whose constructor is passed the InputStream object returned from calling pageUrl.openStream( ). Next, a while loop is used to read the contents of the page, line by line, until the reader.readLine( ) method returns null, signaling that all lines have been read. Each line that is read with the while loop is added to the pageBuffer StringBuffer instance. After the page has been downloaded, its contents are returned as a String by calling pageBuffer.toString( ).

If an error occurs when opening the input stream to the page URL or while reading the contents of the Web page, an exception will be thrown. This exception will be caught by the empty catch block. The catch block has purposefully been left blank so that execution will continue to the remaining return null line. A return value of null from this method indicates to callers that an error occurred.

The removeWwwFromUrl( ) Method

The removeWwwFromUrl( ) method is a simple utility method used to remove the “www” portion of a URL’s host. For example, take the URL:

  http://www.osborne.com

This method removes the “www.” piece of the URL, yielding:

  http://osborne.com

Because many Web sites intermingle URLs that do and don’t start with “www”, the Search Crawler uses this technique to find the “lowest common denominator” URL. Effectively, both URLs are the same on most Web sites, and having the lowest common denominator allows the Search Crawler to skip over duplicate URLs that would otherwise be redundantly crawled.

The removeWwwFromUrl( ) method is shown here:

// Remove leading "www" from a URL's host if present.
private String removeWwwFromUrl(String url) {
 
int index = url.indexOf("://www.");
 
if (index != -1) {
   
return url.substring(0, index + 3) +
      url.substring(index + 7);
  }
 
return (url);
}

The removeWwwFromUrl( ) method starts out by finding the index of "://www." inside the string passed to url. The "://" at the beginning of the string passed to the indexOf( ) method indicates that "www" should be found at the beginning of a URL where the protocol is defined (for example, http://www.osborne.com). This way, URLs that simply contain the string "www" are not tampered with. If url contains "://www.", the characters before and after "www." are concatenated and returned. Otherwise, the string passed to url is returned.

The retrieveLinks( ) Method

The retrieveLinks( ) method parses through the contents of a Web page and retrieves all the relevant links. The Web page for which links are being retrieved is stored in a large String object. To say the least, parsing through this string, looking for specific character sequences, would be quite cumbersome using the methods defined by the String class. Fortunately, beginning with Java 2, v1.4, Java comes standard with a regular expression API library that makes easy work of parsing through strings.

The regular expression API is contained in java.util.regex. The topic of regular expressions is fairly large, and a complete discussion is beyond the scope of this book. However, because parsing regular expressions is key to Search Crawler, a brief overview is presented here.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials