Home arrow Java arrow Page 7 - Crawling the Web with Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article

Crawling the Web with Java - The search() Method
(Page 7 of 15 )

The search( ) method is used to begin the Web crawling process. Since this process can take a considerable amount of time to complete, a new thread is created so that the search code can run independently. This frees up Swing’s event thread, allowing changes in the interface to take place while crawling is underway.

The search( ) method starts with these lines of code:

// Start the search in a new thread.
Thread thread = new Thread(new Runnable() {
  public void run() {

To run the search code in a separate thread, a new Thread object is instantiated with a Runnable instance passed to its constructor. Instead of creating a separate class that implements the Runnable interface, the code is in-lined.

Before the search starts, the interface controls are updated to indicate that crawling is underway, as shown here:

// Show hour glass cursor while crawling is under way. setCursor(Cursor.getPredefinedCursor(Cursor.WAIT_CURSOR));
// Disable search controls.
// Switch search button to "Stop."

First, the application’s cursor is set to the WAIT_CURSOR to signify that the application is busy. On most operating systems, the WAIT_CURSOR is an hourglass. After the cursor has been set, each of the search interface controls is disabled by calling the setEnabled( ) method with a false flag on the control. Next, the Search button is changed to read “Stop.” The Search button is changed, because when searching is underway the button doubles as a control for stopping the current search.

After disabling the search controls, the Stats section of the interface is reset, as shown here:

// Reset stats.
table.setModel(new DefaultTableModel(new Object[][]{},
new String[]{"URL"}) {
  public boolean isCellEditable(int row, int column)
return false;
updateStats(startUrl, 0, 0, maxUrls);

First, the matches table’s data model is reset by passing the setModel( ) method an all new, empty DefaultTableModel instance. Second, the updateStats( ) method is called to refresh the progress bar and the status labels.

Next, the log file is opened and the crawling flag is turned on:

// Open matches log file.
try {
logFileWriter = new PrintWriter(new FileWriter(logFile));
} catch (Exception e) {
  showError("Unable to open matches log file.");
// Turn crawling flag on.
crawling = true;

The log file is opened by way of creating a new PrintWriter instance for writing to the file. If the file cannot be opened, an error dialog box is displayed by calling showError( ). The crawling flag is set to true to indicate to the actionSearch( ) method that crawling is underway.

The following code kicks off the actual search crawling by invoking the crawl( ) method:

// Perform the actual crawling.
crawl(startUrl, maxUrls, limitCheckBox.isSelected(), 
  searchString, caseCheckBox.isSelected());

After crawling has completed, the crawling flag is turned off and the matches log file is closed, as shown here:

// Turn crawling flag off.
crawling = false;
// Close matches log file.
try {
} catch (Exception e) {
  showError("Unable to close matches log file.");

The crawling flag is set to false to indicate that crawling is no longer underway. Next, the matches log file is closed since crawling is finished. Similar to opening the file, if an exception is thrown while trying to close the file, an error dialog box will be shown with a call to showError( ).

Because crawling is finished, the search controls are reactivated by the following code:

// Mark search as done.
// Enable search controls.
// Switch search button back to "Search." searchButton.setText("Search");
// Return to default cursor.

First, the Crawling field is updated to display “Done.” Second, each of the search controls is reenabled. Third, the Stop button is reverted back to displaying “Search.” Finally, the cursor is reverted back to the default application cursor.

If the search did not yield any matches, the following code displays a dialog box to indicate this fact:

// Show message if search string not found.
if (table.getRowCount() == 0) {
    "Your Search String was not found. Please try another.",
    "Search String Not Found", 

The search( ) method wraps up with the following lines of code:


After the Runnable implementation’s run( ) method has been defined, the search thread is started with a call to thread.start( ). Upon the thread’s execution, the Runnable instance’s run( ) method will be invoked.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2018 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials