Java
  Home arrow Java arrow Page 2 - Crawling the Web with Java
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Sun Developer Network 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Crawling the Web with Java
By: McGraw-Hill/Osborne
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 43
    2005-06-09

    Table of Contents:
  • Crawling the Web with Java
  • Fundamentals of a Web Crawler
  • An Overview of the Search Crawler
  • The SearchCrawler Class part 1
  • The SearchCrawler Class part 2
  • SearchCrawler Variables and Constructor
  • The search() Method
  • The showError() and updateStats() Methods
  • The addMatch() and verifyURL() Methods
  • The downloadPage(), removeWwwFromURL(), and
  • An Overview of Regular Expression Processing
  • A Close Look at retrieveLinks()
  • The searchStringMatches() Method
  • The crawl() Method
  • Compiling and Running the Search Web Crawler

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Crawling the Web with Java - Fundamentals of a Web Crawler


    (Page 2 of 15 )

    Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work:

    1. Download the Web page.

    2. Parse through the downloaded page and retrieve all the links.

    3. For each link retrieved, repeat the process.

    Now let’s look at each step of the process in more detail.

    In the first step, a Web crawler takes a URL and downloads the page from the Internet at the given URL. Oftentimes the downloaded page is saved to a file on disk or put in a database. Saving the page allows the crawler or other software to go back later and manipulate the page, be it for indexing words (as in the case with a search engine) or for archiving the page for use by an automated archiver.

    In the second step, a Web crawler parses through the downloaded page and retrieves the links to other pages. Each link in the page is defined with an HTML anchor tag similar to the one shown here:

     <A HREF="http://www.host.com/directory/file.html">Link</A>

    After the crawler has retrieved the links from the page, each link is added to a list of links to be crawled.

    The third step of Web crawling repeats the process. All crawlers work in a recursive or loop fashion, but there are two different ways to handle it. Links can be crawled in a depth-first or breadth-first manner. Depth-first crawling follows each possible path to its conclusion before another path is tried. It works by finding the first link on the first page. It then crawls the page associated with that link, finding the first link on the new page, and so on, until the end of the path has been reached. The process continues until all the branches of all the links have been exhausted.

    Breadth-first crawling checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first page’s first link, and so on, until each level of links has been exhausted. Choosing whether to use depth-or breadth-first crawling often depends on the crawling application and its needs. Search Crawler uses breadth-first crawling, but you can change this behavior if you like.

    Although Web crawling seems quite simple at first glance, there’s actually a lot that goes into creating a full-fledged Web crawling application. For example, Web crawlers need to adhere to the “Robot protocol,” as explained in the following section. Web crawlers also have to handle many “exception” scenarios such as Web server errors, redirects, and so on.

    Adhering to the Robot Protocol

    As you can imagine, crawling a Web site can put an enormous strain on a Web server’s resources as a myriad of requests are made back to back. Typically, a few pages are downloaded at a time from a Web site, not hundreds or thousands in succession. Web sites also often have restricted areas that crawlers should not crawl. To address these concerns, many Web sites adopted the Robot protocol, which establishes guidelines that crawlers should follow. Over time, the protocol has become the unwritten law of the Internet for Web crawlers.

    The Robot protocol specifies that Web sites wishing to restrict certain areas or pages from crawling have a file called robots.txt placed at the root of the Web site. Ethical crawlers will reference the robot file and determine which parts of the site are disallowed for crawling. The disallowed areas will then be skipped by the ethical crawlers. Following is an example robots.txt file and an explanation of its format:

    # robots.txt for http://somehost.com/
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /registration  # Disallow robots on registration page
    Disallow: /login

    The first line of the sample file has a comment on it, as denoted by the use of a hash (#) character. Comments can be on lines unto themselves or on statement lines, as shown on the fifth line of the preceding sample file. Crawlers reading robots.txt files should ignore any comments.

    The third line of the sample file specifies the User-agent to which the Disallow rules following it apply. User-agent is a term used for the programs that access a Web site. For example, when accessing a Web site with Microsoft’s Internet Explorer, the User-agent is “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)” or something similar to it. Each browser has a unique User-agent value that it sends along with each request to a Web server. Web crawlers also typically send a User-agent value along with each request to a Web server. The use of User-agents in the robots.txt file allows Web sites to set rules on a User-agent–by–User-agent basis. However, typically Web sites want to disallow all robots (or User-agents) access to certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that all User-agents are disallowed for the rules that follow it. You might be thinking that the use of an asterisk to disallow all User-agents from accessing a site would prevent standard browser software from working with certain sections of Web sites. This is not a problem, though, because browsers do not observe the Robot protocol and are not expected to.

    The lines following the User-agent line are called disallow statements. The disallow statements define the Web site paths that crawlers are not allowed to access. For example, the first disallow statement in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus, the URLs

     http://somehost.com/cgi-bin/
     http://somehost.com/cgi-bin/register

    are both off limits to crawlers according to that line. Disallow statements are for paths and not specific files; thus any link being requested that contains a path on the disallow list is off limits.

    More Java Articles
    More By McGraw-Hill/Osborne


       · I'm tring the web crawler for big samples (I mean to limit it to 10000 pages) but...
       · The HashSet isn't the only possible cause of out of memory errors. This program is...
       · The biggest problem with this program is that it doesn't transform relative links to...
       · It does transform relative links of the form "foo/bar.html", that is, links which...
       · ok
       · Hi,I d like to report a problem that i have and ask you if you all do have this...
       · Hi,I can't find the source code of this project.Could somebody send it to me to...
     

    Buy this book now. This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...






    © 2003-2008 by Developer Shed. All rights reserved. DS Cluster 2 hosted by Hostway
    Stay green...Green IT