Home arrow Java arrow Page 4 - Crawling the Web with Java
JAVA

Crawling the Web with Java


Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
TABLE OF CONTENTS:
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article
SEARCH DEVARTICLES

Crawling the Web with Java - The SearchCrawler Class part 1
(Page 4 of 15 )

SearchCrawler has a main( ) method, so on execution it will be invoked first. The main( ) method instantiates a new SearchCrawler object and then calls its show( ) method, which causes it to be displayed.

The SearchCrawler class is shown here and is examined in detail in the following sections. Notice that it extends JFrame:

import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
import javax.swing.*;
import javax.swing.table.*;
// The Search Web Crawler
public class SearchCrawler extends JFrame
{
 
// Max URLs drop-down values.
  private static final String[] MAX_URLS =
    {"50", "100", "500", "1000"};
  // Cache of robot disallow lists.
  private HashMap disallowListCache = new HashMap();
 
// Search GUI controls.
  private JTextField startTextField;
  private JComboBox maxComboBox;
  private JCheckBox limitCheckBox;
  private JTextField logTextField;
  private JTextField searchTextField;
  private JCheckBox caseCheckBox;
  private JButton searchButton;
 
// Search stats GUI controls.
  private JLabel crawlingLabel2;
  private JLabel crawledLabel2;
  private JLabel toCrawlLabel2;
  private JProgressBar progressBar;
  private JLabel matchesLabel2;
 
// Table listing search matches.
  private JTable table;
 
// Flag for whether or not crawling is underway.
  private boolean crawling;
 
// Matches log file print writer.
  private PrintWriter logFileWriter;
 
// Constructor for Search Web Crawler.
  public SearchCrawler()
  {
   
// Set application title.
    setTitle("Search Crawler");
   
// Set window size.
    setSize(600, 600);
     
// Handle window closing events.
    addWindowListener(new WindowAdapter() {
     public void windowClosing(WindowEvent e) {
       actionExit();
     }
    });
   
// Set up File menu.
    JMenuBar menuBar = new JMenuBar();
    JMenu fileMenu = new JMenu("File");  
    fileMenu.setMnemonic(KeyEvent.VK_F);
    JMenuItem fileExitMenuItem = new JMenuItem("Exit",
     
KeyEvent.VK_X);
    fileExitMenuItem.addActionListener(new ActionListener() {
      public void actionPerformed(ActionEvent e) {  
        actionExit();
     
}
    });
    fileMenu.add(fileExitMenuItem);
    menuBar.add(fileMenu);
    setJMenuBar(menuBar);
   
// Set up search panel.
    JPanel searchPanel = new JPanel();
    GridBagConstraints constraints;
    GridBagLayout layout = new GridBagLayout(); 
    searchPanel.setLayout(layout);
   
JLabel startLabel = new JLabel("Start URL:");
    constraints = new GridBagConstraints();
   
constraints.anchor = GridBagConstraints.EAST;  
    constraints.insets = new Insets(5, 5, 0, 0); 
    layout.setConstraints(startLabel, constraints);
    searchPanel.add(startLabel);
    startTextField = new JTextField();
    constraints = new GridBagConstraints(); 
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5); 
    layout.setConstraints(startTextField, constraints);
    searchPanel.add(startTextField);
    JLabel maxLabel = new JLabel("Max URLs to Crawl:"); 
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(maxLabel, constraints);
    searchPanel.add(maxLabel);
   
maxComboBox = new JComboBox(MAX_URLS); 
    maxComboBox.setEditable(true);
    constraints = new GridBagConstraints(); 
    constraints.insets = new Insets(5, 5, 0, 0); 
    layout.setConstraints(maxComboBox, constraints);
    searchPanel.add(maxComboBox);
   
limitCheckBox =
     
new JCheckBox("Limit crawling to Start URL site");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.WEST;
    constraints.insets = new Insets(0, 10, 0, 0);
    layout.setConstraints(limitCheckBox, constraints);
    searchPanel.add(limitCheckBox);
   
JLabel blankLabel = new JLabel();
    constraints = new GridBagConstraints();
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    layout.setConstraints(blankLabel, constraints); 
    searchPanel.add(blankLabel);
   
JLabel logLabel = new JLabel("Matches Log File:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
   
layout.setConstraints(logLabel, constraints);
    searchPanel.add(logLabel);
    String file =
      System.getProperty("user.dir") +
      System.getProperty("file.separator") +
      "crawler.log";
    
logTextField = new JTextField(file);
    constraints = new GridBagConstraints(); 
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5);
    layout.setConstraints(logTextField, constraints);
    searchPanel.add(logTextField);
   
JLabel searchLabel = new JLabel("Search String:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;  
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(searchLabel, constraints);
    searchPanel.add(searchLabel);
   
searchTextField = new JTextField();
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.insets = new Insets(5, 5, 0, 0);
    constraints.gridwidth= 2;
    constraints.weightx = 1.0d;
    layout.setConstraints(searchTextField, constraints);
    searchPanel.add(searchTextField);
    caseCheckBox = new JCheckBox("Case Sensitive");
    constraints = new GridBagConstraints();
    constraints.insets = new Insets(5, 5, 0, 5);
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    layout.setConstraints(caseCheckBox, constraints);
    searchPanel.add(caseCheckBox);
   
searchButton = new JButton("Search");
    searchButton.addActionListener(new ActionListener() {
      public void actionPerformed(ActionEvent e) {
        actionSearch();
     
}
    });
    constraints = new GridBagConstraints();
   
constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 5, 5);
    layout.setConstraints(searchButton, constraints);
    searchPanel.add(searchButton);
   
JSeparator separator = new JSeparator();
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 5, 5);
    layout.setConstraints(separator, constraints);
    searchPanel.add(separator);
   
JLabel crawlingLabel1 = new JLabel("Crawling:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(crawlingLabel1, constraints);
    searchPanel.add(crawlingLabel1);
   
crawlingLabel2 = new JLabel();
    crawlingLabel2.setFont(
      
crawlingLabel2.getFont().deriveFont(Font.PLAIN));
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5);
    layout.setConstraints(crawlingLabel2, constraints);
    searchPanel.add(crawlingLabel2);
   
JLabel crawledLabel1 = new JLabel("Crawled URLs:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(crawledLabel1, constraints);
    searchPanel.add(crawledLabel1);
   
crawledLabel2 = new JLabel();
    crawledLabel2.setFont(
      
crawledLabel2.getFont().deriveFont(Font.PLAIN));
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5);
   
layout.setConstraints(crawledLabel2, constraints);
    searchPanel.add(crawledLabel2);
   
JLabel toCrawlLabel1 = new JLabel("URLs to Crawl:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(toCrawlLabel1, constraints);
    searchPanel.add(toCrawlLabel1);
   
toCrawlLabel2 = new JLabel();
    toCrawlLabel2.setFont(
      
toCrawlLabel2.getFont().deriveFont(Font.PLAIN));
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5);
    layout.setConstraints(toCrawlLabel2, constraints);
    searchPanel.add(toCrawlLabel2);
   
JLabel progressLabel = new JLabel("Crawling Progress:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 0, 0);
    layout.setConstraints(progressLabel, constraints);
    searchPanel.add(progressLabel);
    progressBar = new JProgressBar();
    progressBar.setMinimum(0);
    progressBar.setStringPainted(true);
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 0, 5);
    layout.setConstraints(progressBar, constraints);
    searchPanel.add(progressBar);
    JLabel matchesLabel1 = new JLabel("Search Matches:");
    constraints = new GridBagConstraints();
    constraints.anchor = GridBagConstraints.EAST;
    constraints.insets = new Insets(5, 5, 10, 0);
    layout.setConstraints(matchesLabel1, constraints);
    searchPanel.add(matchesLabel1);
   
matchesLabel2 = new JLabel();
    matchesLabel2.setFont(
     
matchesLabel2.getFont().deriveFont(Font.PLAIN));
    constraints = new GridBagConstraints();
    constraints.fill = GridBagConstraints.HORIZONTAL;
    constraints.gridwidth = GridBagConstraints.REMAINDER;
    constraints.insets = new Insets(5, 5, 10, 5);
    layout.setConstraints(matchesLabel2, constraints);
    searchPanel.add(matchesLabel2);
   
// Set up matches table.
    table =
      new JTable(new DefaultTableModel(new Object[][]{},
       
new String[]{"URL"}) {
      public boolean isCellEditable(int row, int column)
      {
       
return false;
      }
    });
   
// Set up Matches panel.
    JPanel matchesPanel = new JPanel();
    matchesPanel.setBorder(
     
BorderFactory.createTitledBorder("Matches"));
    matchesPanel.setLayout(new BorderLayout());
    matchesPanel.add(new JScrollPane(table),
     
BorderLayout.CENTER);
   
// Add panels to display.
    getContentPane().setLayout(new BorderLayout());
    getContentPane().add(searchPanel, BorderLayout.NORTH);
    getContentPane().add(matchesPanel,BorderLayout.CENTER);
 
}
  // Exit this program.
  private void actionExit() {
    System.exit(0);
  }
 
// Handle Search/Stop button being clicked.
 
private void actionSearch() {
    // If stop button clicked, turn crawling flag off.
    if (crawling) {
     
crawling = false;
      return;
 
}
 
ArrayList errorList = new ArrayList();
 
// Validate that start URL has been entered.
  String startUrl = startTextField.getText().trim();
  if (startUrl.length() < 1) {
   
errorList.add("Missing Start URL.");
  }
  // Verify start URL.
  else if (verifyUrl(startUrl) == null) {
   
errorList.add("Invalid Start URL.");
  }
 
// Validate that Max URLs is either empty or is a number.
  int maxUrls = 0;
  String max = ((String) maxComboBox.getSelectedItem()).trim();
  if (max.length() > 0) {
   
try {
     
maxUrls = Integer.parseInt(max);
    } catch (NumberFormatException e) {
    }
    if (maxUrls < 1) {
     
errorList.add("Invalid Max URLs value.");
    }
  }
 
// Validate that matches log file has been entered. 
  String logFile = logTextField.getText().trim();
  if (logFile.length() < 1) {
   
errorList.add("Missing Matches Log File.");
  }
 
// Validate that search string has been entered.
  String searchString = searchTextField.getText().trim();
  if (searchString.length() < 1) {
   
errorList.add("Missing Search String.");
  }
 
// Show errors, if any, and return.
  if (errorList.size() > 0) {
    StringBuffer message = new StringBuffer();
   
// Concatenate errors into single message.
    for (int i = 0; i < errorList.size(); i++) {
      
message.append(errorList.get(i));
      if (i + 1 < errorList.size()) {
        message.append("\n");
      }
    }
   
showError(message.toString());
    return;
  }
  // Remove "www" from start URL if present.
  startUrl = removeWwwFromUrl(startUrl);
  // Start the Search Crawler.
  search(logFile, startUrl, maxUrls, searchString);
}


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials