Java
  Home arrow Java arrow Page 6 - Crawling the Web with Java
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Moblin 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Crawling the Web with Java
By: McGraw-Hill/Osborne
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 43
    2005-06-09

    Table of Contents:
  • Crawling the Web with Java
  • Fundamentals of a Web Crawler
  • An Overview of the Search Crawler
  • The SearchCrawler Class part 1
  • The SearchCrawler Class part 2
  • SearchCrawler Variables and Constructor
  • The search() Method
  • The showError() and updateStats() Methods
  • The addMatch() and verifyURL() Methods
  • The downloadPage(), removeWwwFromURL(), and
  • An Overview of Regular Expression Processing
  • A Close Look at retrieveLinks()
  • The searchStringMatches() Method
  • The crawl() Method
  • Compiling and Running the Search Web Crawler

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Crawling the Web with Java - SearchCrawler Variables and Constructor


    (Page 6 of 15 )

    The SearchCrawler Variables

    SearchCrawler starts off by declaring several instance variables, most of which hold references to the interface controls. First, the MAX_URLS String array declares the list of values to be displayed in the Max URLs to Crawl combo box. Second, disallowListCache is defined for caching robot disallow lists so that they don’t have to be retrieved for each URL being crawled. Next, each of the interface controls is declared for the Search, Stats, and Matches sections of the interface. After the interface controls have been declared, the crawling flag is defined for tracking whether or not crawling is underway. Finally, the logFileWriter instance variable, which is used for printing search matches to a log file, is declared.

    The SearchCrawler Constructor

    When the SearchCrawler is instantiated, all the interface controls are initialized inside its constructor. The constructor contains a lot of code, but most of it is straightforward. The following discussion gives an overview.

    First, the application’s window title is set with a call to setTitle( ). Next, the setSize( ) call establishes the window’s width and height in pixels. After that, a window listener is added by calling addWindowListener( ), which passes a WindowAdapter object that overrides the windowClosing( ) event handler. This handler calls the actionExit( ) method when the application’s window is closed. Next, a menu bar with a File menu is added to the application’s window.

    The next several lines of the constructor initiate and lay out the interface controls. Similar to other applications in this book, the layout is arranged using the GridBagLayout class and its associated GridBagConstraints class. First, the Search section of the interface is laid out, followed by the Stats section. The Search section includes all the controls for entering the search criteria and constraints. The Stats section holds all the controls for displaying the current crawling status, such as how many URLs have been crawled and how many URLs are left to crawl.

    It’s important to point out three things in the Search and Stats sections. First, the Matches Log File text field control is initialized with a string containing a filename. This string is set to a file called crawler.log in the directory the application is run from, as specified by the Java environment variable user.dir. Second, an ActionListener is added to the Search button so that the actionSearch( ) method is called each time the button is clicked. Third, the font for each label that is used to display results is updated with a call to setFont( ). The setFont( ) call is used to turn off the bolding of the label fonts so that they are distinguished in the interface.

    Following the Search and Stats sections of the interface is the Matches section that consists of the matches table, which contains the URLs containing the search string. The matches table is instantiated with a new DefaultTableModel subclass passed to its constructor. Typically a fully qualified subclass of DefaultTableModel is used to customize the data model used by a JTable; however, in this case only the isCellEditable( ) method needs to be implemented. The isCellEditable( ) method instructs the table that no cells should be editable by returning false, regardless of the row and column specified.

    Once the matches table is initialized, it is added to the Matches panel. Finally, the Search panel and Matches panel are added to the interface.

    The actionSearch() Method

    The actionSearch( ) method is invoked each time the Search (or Stop) button is clicked. The actionSearch( ) method starts with these lines of code:

    // If stop button clicked, turn crawling flag off.
    if (crawling) {
      crawling = false;
     
    return; }

    Since the Search button in the interface doubles as both the Search button and the Stop button, it’s necessary to know which of the two buttons was clicked. When crawling is underway, the crawling flag is set to true. Thus if the crawling flag is true when the actionsearch( ) method is invoked, the Stop button was clicked. In this scenario, the crawling flag is set to false and actionSearch( ) returns so that the rest of the method is not executed.

    Next, an ArrayList variable, errorList, is initialized:

    ArrayList errorList = new ArrayList();

    The errorList is used to hold any error messages generated by the next several lines of code that validate all required search fields have been entered.

    It goes without saying that the Search Crawler will not function without a URL that specifies the location at which to start crawling. The following code verifies that a starting URL has been entered and that the URL is valid:

    // Validate that the start URL has been entered.
    String startUrl = startTextField.getText().trim();
    if (startUrl.length() < 1) {
     
    errorList.add("Missing Start URL.");
    }
    // Verify start URL.
    else if (verifyUrl(startUrl) == null) {
     
    errorList.add("Invalid Start URL.");
    }

    If either of these checks fails, an error message is added to the error list. Next, the Max URLs to Crawl combo box value is validated:

    // Validate that Max URLs is either empty or is a number. int maxUrls = -1;
    String max = ((String) maxComboBox.getSelectedItem()).trim();
    if (max.length() > 0) {
     
    try {
       
    maxUrls = Integer.parseInt(max);
      } catch (NumberFormatException e) {
      }
      if (maxUrls < 1) {
       
    errorList.add("Invalid Max URLs value.");
      }
    }

    Validating the maximum number of URLs to crawl is a bit more involved than the other validations in this method. This is because the Max URLs to Crawl field can either contain a positive number that indicates the maximum number of URLs to crawl or can be left blank to indicate that no maximum should be used. Initially, maxUrls is defaulted to –1 to indicate no maximum. If the user enters something into the Max URLs to Crawl field, it is validated as being a valid numeric value with a call to Integer.parseInt( ). Integer.parseInt( ) converts a String representation of an integer into an int value. If the String representation cannot be converted to an integer, a NumberFormatException is thrown and the maxUrls value is not set. Next, maxUrls is checked to see if it is less than 1. If so, an error is added to the error list.

    Next, the Matches Log File and Search String fields are validated:

    // Validate that the matches log file has been entered. String logFile = logTextField.getText().trim();
    if (logFile.length() < 1) {
     
    errorList.add("Missing Matches Log File.");
    }
    // Validate that the search string has been entered.
    String searchString = searchTextField.getText().trim();
    if (searchString.length() < 1) {
     
    errorList.add("Missing Search String.");
    }

    If either of these fields has not been entered, an error message is added to the error list.

    The following code checks to see if any errors have been recorded during validation. If so, all the errors are concatenated into a single message and displayed with a call to showError( ).

    // Show errors, if any, and return.
    if (errorList.size() > 0) {
      StringBuffer message = new StringBuffer();
     
    // Concatenate errors into single message.
      for (int i = 0; i < errorList.size(); i++) {
        message.append(errorList.get(i));
        if (i + 1 < errorList.size()) {
         
    message.append("\n");
        }
      }
     
    showError(message.toString());
      return;
    }

    For efficiency’s sake, a StringBuffer object (referred to by message) is used to hold the concatenated message. The error list is iterated over with a for loop, adding each message to message. Notice that each time a message is added, a check is performed to see if the message is the last in the list or not. If the message is not the last message in the list, a newline (\n) character is added so that each message will be displayed on its own line in the error dialog box shown with the showError( ) method.

    Finally, after all the field validations are successful, actionSearch( ) concludes by removing “www” from the starting URL and then calling the search( ) method:

    // Remove "www" from start URL if present.
    startUrl = removeWwwFromUrl(startUrl);
    // Start the Search Crawler.
    search(logFile, startUrl, maxUrls, searchString);

    More Java Articles
    More By McGraw-Hill/Osborne


       · I'm tring the web crawler for big samples (I mean to limit it to 10000 pages) but...
       · The HashSet isn't the only possible cause of out of memory errors. This program is...
       · The biggest problem with this program is that it doesn't transform relative links to...
       · It does transform relative links of the form "foo/bar.html", that is, links which...
       · ok
       · Hi,I d like to report a problem that i have and ask you if you all do have this...
       · Hi,I can't find the source code of this project.Could somebody send it to me to...
     

    Buy this book now. This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 1 hosted by Hostway
    Stay green...Green IT