Home arrow Java arrow Page 9 - Crawling the Web with Java
JAVA

Crawling the Web with Java


Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
TABLE OF CONTENTS:
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article
SEARCH DEVARTICLES

Crawling the Web with Java - The addMatch() and verifyURL() Methods
(Page 9 of 15 )

The addMatch( ) Method

The addMatch( ) method is called by the crawl( ) method each time a match with the search string is found. The addMatch( ) method, shown here, adds a URL to both the matches table and the log file:

// Add match to matches table and log file.
private void addMatch(String url) {
  // Add URL to matches table.
  DefaultTableModel model =
    (DefaultTableModel) table.getModel();
  model.addRow(new Object[]{url});
  // Add URL to matches log file.
  try {
    logFileWriter.println(url);
  } catch (Exception e) {
    showError("Unable to log match.");
  }
}

This method first adds the URL to the matches table by retrieving the table’s data model and calling its addRow( ) method. Notice that the addRow( ) method takes an Object array as input. In order to satisfy that requirement, the url String object is wrapped in an Object array. After adding the URL to the matches table, the URL is written to the log file with a call to logFileWriter.println( ). This call is wrapped in a try-catch block; and if an exception is thrown, the showError( ) method is called to alert the user that an error has occurred while trying to write to the log file.

The verifyUrl( ) Method

The verifyUrl( ) method, shown here, is used throughout SearchCrawler to verify the format of a URL. Additionally, this method serves to convert a string representation of a URL into a URL object:

// Verify URL format.
private URL verifyUrl(String url) {
 
// Only allow HTTP URLs.
 
if (!url.toLowerCase().startsWith("http://"))
    return null;
 
// Verify format of URL.
  URL verifiedUrl = null;
 
try {
    verifiedUrl = new URL(url);
  } catch (Exception e) {
    return null;
  }
 
return verifiedUrl;
}

This method first verifies that the given URL is an HTTP URL since only HTTP URLs are supported by Search Crawler. Next, the URL being verified is used to construct a new URL object. If the URL is malformed, the URL class constructor will throw an exception resulting in null being returned from this method. A null return value is used to denote that the string passed to url is not valid or verified.

The isRobotAllowed() Method

The isRobotAllowed( ) method fulfills the robot protocol. In order to fully explain this method, we’ll review it line by line.

The isRobotAllowed( ) method starts with these lines of code:

String host = urlToCheck.getHost().toLowerCase();
// Retrieve host's disallow list from cache.
ArrayList disallowList =
 
(ArrayList) disallowListCache.get(host);
// If list is not in the cache, download and cache it.
if (disallowList == null) {
  disallowList = new ArrayList();

In order to efficiently check whether or not robots are allowed to access a URL, Search Crawler caches each host’s disallow list after it has been retrieved. This significantly improves the performance of Search Crawler because it avoids downloading the disallow list for each URL being verified. Instead, it just retrieves the list from cache.

The disallow list cache is keyed on the host portion of a URL, so isRobotAllowed( ) starts out by retrieving the urlToCheck’s host by calling its getHost( ) method. Notice that toLowerCase( ) is tacked on to the end of the getHost( ) call. Lowercasing the host ensures that duplicate host entries are not placed in the cache. Take note that the host portion of URLs is case insensitive on the Internet; however, the cache keys are case-sensitive strings. After retrieving the urlToCheck’s host, an attempt to retrieve a disallow list from the cache for the host is made. If there is not a list in cache already, null is returned, signaling that the disallow list must be downloaded from the host. The process of retrieving the disallow list from a host starts by creating a new ArrayList object.

Next, the contents of the disallow list are populated, beginning with the following lines of code:

try {
  URL robotsFileUrl =
    new URL("http://" + host + "/robots.txt");
 
// Open connection to robot file URL for reading.  
  BufferedReader reader =
    new BufferedReader(new InputStreamReader(
      robotsFileUrl.openStream()));

As mentioned earlier, Web site owners wishing to prevent Web crawlers from crawling their site, or portions of their site, must have a file called robots.txt at the root of their Web site hierarchy. The host portion of the urlToCheck is used to construct a URL, robotsFileUrl, pointing to the robots.txt file. Then a BufferedReader object is created for reading the contents of the robots.txt file. The BufferedReader’s constructor is passed an instance of InputStreamReader, whose constructor is passed the InputStream object returned from calling robotsFileUrl.openStream( ).

The following sequence sets up a while loop for reading the contents of the robots.txt file:

// Read robot file, creating list of disallowed paths. String line;
while ((line = reader.readLine()) != null) {
 
if (line.indexOf("Disallow:") == 0) {
    String disallowPath =
      line.substring("Disallow:".length());

The loop reads the contents of the file, line by line, until the reader.readLine( ) method returns null, signaling that all lines have been read. Each line that is read is checked to see if it has a Disallow statement by using the indexOf( ) method defined by String. If the line does in fact have a Disallow statement, the disallow path is culled from the line by taking a substring of the line from the point where the string "Disallow:" ends.

As discussed, comments can be interspersed in the robots.txt file by using a hash (#) character followed by a comment. Since comments will interfere with the Disallow statement comparisons, they are removed in the following lines of code:

     // Check disallow path for comments and remove if present.
    int commentIndex = disallowPath.indexOf("#");
    if (commentIndex != - 1) {
      
disallowPath =
        disallowPath.substring(0, commentIndex);
    }
    // Remove leading or trailing spaces from disallow path.
    disallowPath = disallowPath.trim();
    // Add disallow path to list.
    disallowList.add(disallowPath);
  }

First, the disallow path is searched to see if it contains a hash character. If it does, the disallow path is substringed, removing the comment from the end of the string. After checking and potentially removing a comment from the disallow path, disallowPath.trim( ) is called to remove any leading or trailing space characters. Similar to comments, extraneous space characters will trip up comparisons, so they are removed. Finally, the disallow path is added to the list of disallow paths.

After the disallow path list has been created, it is added to the disallow list cache, as shown here:

    // Add new disallow list to cache.
    
disallowListCache.put(host, disallowList);
  }
  catch (Exception e) {
   
/* Assume robot is allowed since an exception
       is thrown if the robot file doesn't exist. */
    return true;
  }
}

The disallow path is added to the disallow list cache so that subsequent requests for the list can be quickly retrieved from cache instead of having to be downloaded again.

If an error occurs while opening the input stream to the robot file URL or while reading the contents of the file, an exception will be thrown. Since an exception will be thrown if the robots.txt file does not exist, we’ll assume robots are allowed if an exception is thrown. Normally, the error checking in this scenario should be more robust; however, for simplicity and brevity’s sake, we’ll make the blanket decision that robots are allowed.

Next, the following code iterates over the disallow list to see if the urlToCheck is allowed or not:

/* Loop through disallow list to see if the
  
crawling is allowed for the given URL. */
String file = urlToCheck.getFile();
for (int i = 0; i < disallowList.size(); i++) {
  
String disallow = (String) disallowList.get(i);
   if (file.startsWith(disallow)) {
     return false;
   }
}
return true;

Each iteration of the for loop checks to see if the file portion of the urlToCheck is found in the disallow list. If the urlToCheck’s file does in fact match one of the statements in the disallow list, then false is returned, indicating that crawlers are not allowed to crawl the given URL. However, if the list is iterated over and no match is made, true is returned, indicating that crawling is allowed.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials