Home arrow Java arrow Page 6 - Crawling the Web with Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article

Crawling the Web with Java - SearchCrawler Variables and Constructor
(Page 6 of 15 )

The SearchCrawler Variables

SearchCrawler starts off by declaring several instance variables, most of which hold references to the interface controls. First, the MAX_URLS String array declares the list of values to be displayed in the Max URLs to Crawl combo box. Second, disallowListCache is defined for caching robot disallow lists so that they don’t have to be retrieved for each URL being crawled. Next, each of the interface controls is declared for the Search, Stats, and Matches sections of the interface. After the interface controls have been declared, the crawling flag is defined for tracking whether or not crawling is underway. Finally, the logFileWriter instance variable, which is used for printing search matches to a log file, is declared.

The SearchCrawler Constructor

When the SearchCrawler is instantiated, all the interface controls are initialized inside its constructor. The constructor contains a lot of code, but most of it is straightforward. The following discussion gives an overview.

First, the application’s window title is set with a call to setTitle( ). Next, the setSize( ) call establishes the window’s width and height in pixels. After that, a window listener is added by calling addWindowListener( ), which passes a WindowAdapter object that overrides the windowClosing( ) event handler. This handler calls the actionExit( ) method when the application’s window is closed. Next, a menu bar with a File menu is added to the application’s window.

The next several lines of the constructor initiate and lay out the interface controls. Similar to other applications in this book, the layout is arranged using the GridBagLayout class and its associated GridBagConstraints class. First, the Search section of the interface is laid out, followed by the Stats section. The Search section includes all the controls for entering the search criteria and constraints. The Stats section holds all the controls for displaying the current crawling status, such as how many URLs have been crawled and how many URLs are left to crawl.

It’s important to point out three things in the Search and Stats sections. First, the Matches Log File text field control is initialized with a string containing a filename. This string is set to a file called crawler.log in the directory the application is run from, as specified by the Java environment variable user.dir. Second, an ActionListener is added to the Search button so that the actionSearch( ) method is called each time the button is clicked. Third, the font for each label that is used to display results is updated with a call to setFont( ). The setFont( ) call is used to turn off the bolding of the label fonts so that they are distinguished in the interface.

Following the Search and Stats sections of the interface is the Matches section that consists of the matches table, which contains the URLs containing the search string. The matches table is instantiated with a new DefaultTableModel subclass passed to its constructor. Typically a fully qualified subclass of DefaultTableModel is used to customize the data model used by a JTable; however, in this case only the isCellEditable( ) method needs to be implemented. The isCellEditable( ) method instructs the table that no cells should be editable by returning false, regardless of the row and column specified.

Once the matches table is initialized, it is added to the Matches panel. Finally, the Search panel and Matches panel are added to the interface.

The actionSearch() Method

The actionSearch( ) method is invoked each time the Search (or Stop) button is clicked. The actionSearch( ) method starts with these lines of code:

// If stop button clicked, turn crawling flag off.
if (crawling) {
  crawling = false;
return; }

Since the Search button in the interface doubles as both the Search button and the Stop button, it’s necessary to know which of the two buttons was clicked. When crawling is underway, the crawling flag is set to true. Thus if the crawling flag is true when the actionsearch( ) method is invoked, the Stop button was clicked. In this scenario, the crawling flag is set to false and actionSearch( ) returns so that the rest of the method is not executed.

Next, an ArrayList variable, errorList, is initialized:

ArrayList errorList = new ArrayList();

The errorList is used to hold any error messages generated by the next several lines of code that validate all required search fields have been entered.

It goes without saying that the Search Crawler will not function without a URL that specifies the location at which to start crawling. The following code verifies that a starting URL has been entered and that the URL is valid:

// Validate that the start URL has been entered.
String startUrl = startTextField.getText().trim();
if (startUrl.length() < 1) {
errorList.add("Missing Start URL.");
// Verify start URL.
else if (verifyUrl(startUrl) == null) {
errorList.add("Invalid Start URL.");

If either of these checks fails, an error message is added to the error list. Next, the Max URLs to Crawl combo box value is validated:

// Validate that Max URLs is either empty or is a number. int maxUrls = -1;
String max = ((String) maxComboBox.getSelectedItem()).trim();
if (max.length() > 0) {
try {
maxUrls = Integer.parseInt(max);
  } catch (NumberFormatException e) {
  if (maxUrls < 1) {
errorList.add("Invalid Max URLs value.");

Validating the maximum number of URLs to crawl is a bit more involved than the other validations in this method. This is because the Max URLs to Crawl field can either contain a positive number that indicates the maximum number of URLs to crawl or can be left blank to indicate that no maximum should be used. Initially, maxUrls is defaulted to –1 to indicate no maximum. If the user enters something into the Max URLs to Crawl field, it is validated as being a valid numeric value with a call to Integer.parseInt( ). Integer.parseInt( ) converts a String representation of an integer into an int value. If the String representation cannot be converted to an integer, a NumberFormatException is thrown and the maxUrls value is not set. Next, maxUrls is checked to see if it is less than 1. If so, an error is added to the error list.

Next, the Matches Log File and Search String fields are validated:

// Validate that the matches log file has been entered. String logFile = logTextField.getText().trim();
if (logFile.length() < 1) {
errorList.add("Missing Matches Log File.");
// Validate that the search string has been entered.
String searchString = searchTextField.getText().trim();
if (searchString.length() < 1) {
errorList.add("Missing Search String.");

If either of these fields has not been entered, an error message is added to the error list.

The following code checks to see if any errors have been recorded during validation. If so, all the errors are concatenated into a single message and displayed with a call to showError( ).

// Show errors, if any, and return.
if (errorList.size() > 0) {
  StringBuffer message = new StringBuffer();
// Concatenate errors into single message.
  for (int i = 0; i < errorList.size(); i++) {
    if (i + 1 < errorList.size()) {

For efficiency’s sake, a StringBuffer object (referred to by message) is used to hold the concatenated message. The error list is iterated over with a for loop, adding each message to message. Notice that each time a message is added, a check is performed to see if the message is the last in the list or not. If the message is not the last message in the list, a newline (\n) character is added so that each message will be displayed on its own line in the error dialog box shown with the showError( ) method.

Finally, after all the field validations are successful, actionSearch( ) concludes by removing “www” from the starting URL and then calling the search( ) method:

// Remove "www" from start URL if present.
startUrl = removeWwwFromUrl(startUrl);
// Start the Search Crawler.
search(logFile, startUrl, maxUrls, searchString);

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials