Java
  Home arrow Java arrow Page 14 - Crawling the Web with Java
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Moblin 
JMSL Numerical Library 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Crawling the Web with Java
By: McGraw-Hill/Osborne
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 42
    2005-06-09

    Table of Contents:
  • Crawling the Web with Java
  • Fundamentals of a Web Crawler
  • An Overview of the Search Crawler
  • The SearchCrawler Class part 1
  • The SearchCrawler Class part 2
  • SearchCrawler Variables and Constructor
  • The search() Method
  • The showError() and updateStats() Methods
  • The addMatch() and verifyURL() Methods
  • The downloadPage(), removeWwwFromURL(), and
  • An Overview of Regular Expression Processing
  • A Close Look at retrieveLinks()
  • The searchStringMatches() Method
  • The crawl() Method
  • Compiling and Running the Search Web Crawler

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Crawling the Web with Java - The crawl() Method


    (Page 14 of 15 )

    The crawl( ) method is the core of the search Web crawler because it performs the actual crawling. It begins with these lines of code:

    // Set up crawl lists.
    HashSet crawledList = new HashSet();
    LinkedHashSet toCrawlList = new LinkedHashSet();
    // Add start URL to the To Crawl list.
    toCrawlList.add(startUrl);

    There are several techniques that can be employed to crawl Web sites, recursion being a natural choice because crawling itself is recursive. Recursion, however, can be quite resource intensive, so the Search Crawler uses a queue technique. Here, toCrawlList is initialized to hold the queue of links to crawl. The start URL is then added to toCrawlList to begin the crawling process.

    After initializing the To Crawl list and adding the start URL, crawling begins with a while loop set up to run until the crawling flag is turned off or until the To Crawl list has been exhausted, as shown here:

    /* Perform actual crawling by looping
      
    through the To Crawl list. */
    while (crawling && toCrawlList.size() > 0)
    {
     
    /* Check to see if the max URL count has
        
    been reached, if it was specified.*/
    if (maxUrls != -1) {
      if (crawledList.size() == maxUrls) {
        break;
      }
    }

    Remember that the crawling flag is used to stop crawling prematurely. If the Stop button on the interface is clicked during crawling, crawling is set to false. The next time the while loop’s expression is evaluated, the loop will end because the crawling flag is false. The first section of code inside the while loop checks to see if the crawling limit specified by maxUrls has been reached. This check is performed only if the maxUrls variable has been set, as indicated by a value other than –1.

    Upon each iteration of the while loop, the following code is executed:

    // Get URL at bottom of the list.
    String url = (String) toCrawlList.iterator().next();
    // Remove URL from the To Crawl list.
    toCrawlList.remove(url);
    // Convert string url to URL object.
    URL verifiedUrl = verifyUrl(url);
    // Skip URL if robots are not allowed to access it.
    if (!isRobotAllowed(verifiedUrl)) {
      continue;
    }

    First, the URL at the bottom of the To Crawl list is “popped” off. Thus, the list works in a first in, first out (FIFO) manner. Since the URLs are stored in a LinkedHashSet object, there is not actually a “pop” method. Instead, the functionality of a pop method is simulated by first retrieving the value at the bottom of the list with a call to toCrawlList.iterator( ).next( ). Then the URL retrieved from the list is removed from the list by calling toCrawlList.remove( ), passing in the URL as an argument.

    After retrieving the next URL from the To Crawl list, the string representation of the URL is converted to a URL object using the verifyUrl( ) method. Next, the URL is checked to see whether or not it is allowed to be crawled by calling the isRobotAllowed( ) method. If the crawler is not allowed to crawl the given URL, then continue is executed to skip to the next iteration of the while loop.

    After retrieving and verifying the next URL on the crawl list, the results are updated in the Stats section, as shown here:

    // Update crawling stats.
    updateStats(url, crawledList.size(), toCrawlList.size(),
      maxUrls);
    // Add page to the crawled list.
    crawledList.add(url);
    // Download the page at the given URL.
    String pageContents = downloadPage(verifiedUrl);

    The output is updated with a call to updateStats( ). The URL is then added to the crawled list, indicating that it has been crawled and that subsequent references to the URL should be skipped. Next, the page at the given URL is downloaded with a call to downloadPage( ).

    If the downloadPage( ) method successfully downloads the page at the given URL, the following code is executed:

    /* If the page was downloaded successfully, retrieve all of its
      
    links and then see if it contains the search string. */ if (pageContents != null && pageContents.length() > 0)
    {
     
    // Retrieve list of valid links from page.
      ArrayList links =
        retrieveLinks(verifiedUrl, pageContents, crawledList, 
          limitHost);
     
    // Add links to the To Crawl list.
      toCrawlList.addAll(links);
     
    /* Check if search string is present in
         page, and if so, record a match. */
      if (searchStringMatches(pageContents, searchString, 
           caseSensitive))
      {
        addMatch(url);
      }
    }

    First, the page links are retrieved by calling the retrieveLinks( ) method. Each of the links returned from the retrieveLinks( ) call is then added to the To Crawl list. Next, the downloaded page is searched to see if the search string is found in the page with a call to searchStringMatches( ). If the search string is found in the page, the page is recorded as a match with the addMatch( ) method.

    The crawl( ) method finishes by calling updateStats( ) again at the end of the while loop:

      // Update crawling stats.
      updateStats(url, crawledList.size(), toCrawlList.size(),
        maxUrls);
    }

    The first call to updateStats( ), earlier in this method, updates the label that indicates which URL is being crawled. This second call updates all the other values because they will have changed since the first call.

    More Java Articles
    More By McGraw-Hill/Osborne


       · I'm tring the web crawler for big samples (I mean to limit it to 10000 pages) but...
       · The HashSet isn't the only possible cause of out of memory errors. This program is...
       · The biggest problem with this program is that it doesn't transform relative links to...
       · It does transform relative links of the form "foo/bar.html", that is, links which...
       · ok
       · Hi,I d like to report a problem that i have and ask you if you all do have this...
       · Hi,I can't find the source code of this project.Could somebody send it to me to...
     

    Buy this book now. This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway
    Stay green...Green IT