Java
  Home arrow Java arrow Page 3 - Crawling the Web with Java
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Crawling the Web with Java
By: McGraw-Hill/Osborne
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 43
    2005-06-09

    Table of Contents:
  • Crawling the Web with Java
  • Fundamentals of a Web Crawler
  • An Overview of the Search Crawler
  • The SearchCrawler Class part 1
  • The SearchCrawler Class part 2
  • SearchCrawler Variables and Constructor
  • The search() Method
  • The showError() and updateStats() Methods
  • The addMatch() and verifyURL() Methods
  • The downloadPage(), removeWwwFromURL(), and
  • An Overview of Regular Expression Processing
  • A Close Look at retrieveLinks()
  • The searchStringMatches() Method
  • The crawl() Method
  • Compiling and Running the Search Web Crawler

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Crawling the Web with Java - An Overview of the Search Crawler


    (Page 3 of 15 )

    Search Crawler is a basic Web crawler for searching the Web, and it illustrates the fundamental structure of crawler-based applications. With Search Crawler, you can enter search criteria and then search the Web in real time, URL by URL, looking for matches to the criteria.

    Search Crawler’s interface, as shown in Figure 6-1, has three prominent sections, which we will refer to as Search, Stats, and Matches. The Search section at the top of the window has controls for entering search criteria, including the start URL for the search, the maximum number of URLs to crawl, and the search string. The search criteria can be additionally tweaked by choosing to limit the search to the site of the beginning URL and by selecting the Case Sensitive check box for the search string.

    The Stats section, located in the middle of the window, has controls showing the current status of crawling when searching is underway. This section also has a progress bar to indicate the progress toward completing the search.

    The Matches section at the bottom of the window has a table listing all the matches found by a search. These are the URLs of the Web pages that contain the search string.

     
    Figure 6-1.  The Search Crawler GUI interface

    More Java Articles
    More By McGraw-Hill/Osborne


       · I'm tring the web crawler for big samples (I mean to limit it to 10000 pages) but...
       · The HashSet isn't the only possible cause of out of memory errors. This program is...
       · The biggest problem with this program is that it doesn't transform relative links to...
       · It does transform relative links of the form "foo/bar.html", that is, links which...
       · ok
       · Hi,I d like to report a problem that i have and ask you if you all do have this...
       · Hi,I can't find the source code of this project.Could somebody send it to me to...
     

    Buy this book now. This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway
    Stay green...Green IT