Home arrow Java arrow Page 2 - Crawling the Web with Java
JAVA

Crawling the Web with Java


Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
TABLE OF CONTENTS:
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article
SEARCH DEVARTICLES

Crawling the Web with Java - Fundamentals of a Web Crawler
(Page 2 of 15 )

Despite the numerous applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work:

  1. Download the Web page.

  2. Parse through the downloaded page and retrieve all the links.

  3. For each link retrieved, repeat the process.

Now let’s look at each step of the process in more detail.

In the first step, a Web crawler takes a URL and downloads the page from the Internet at the given URL. Oftentimes the downloaded page is saved to a file on disk or put in a database. Saving the page allows the crawler or other software to go back later and manipulate the page, be it for indexing words (as in the case with a search engine) or for archiving the page for use by an automated archiver.

In the second step, a Web crawler parses through the downloaded page and retrieves the links to other pages. Each link in the page is defined with an HTML anchor tag similar to the one shown here:

 <A HREF="http://www.host.com/directory/file.html">Link</A>

After the crawler has retrieved the links from the page, each link is added to a list of links to be crawled.

The third step of Web crawling repeats the process. All crawlers work in a recursive or loop fashion, but there are two different ways to handle it. Links can be crawled in a depth-first or breadth-first manner. Depth-first crawling follows each possible path to its conclusion before another path is tried. It works by finding the first link on the first page. It then crawls the page associated with that link, finding the first link on the new page, and so on, until the end of the path has been reached. The process continues until all the branches of all the links have been exhausted.

Breadth-first crawling checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first page’s first link, and so on, until each level of links has been exhausted. Choosing whether to use depth-or breadth-first crawling often depends on the crawling application and its needs. Search Crawler uses breadth-first crawling, but you can change this behavior if you like.

Although Web crawling seems quite simple at first glance, there’s actually a lot that goes into creating a full-fledged Web crawling application. For example, Web crawlers need to adhere to the “Robot protocol,” as explained in the following section. Web crawlers also have to handle many “exception” scenarios such as Web server errors, redirects, and so on.

Adhering to the Robot Protocol

As you can imagine, crawling a Web site can put an enormous strain on a Web server’s resources as a myriad of requests are made back to back. Typically, a few pages are downloaded at a time from a Web site, not hundreds or thousands in succession. Web sites also often have restricted areas that crawlers should not crawl. To address these concerns, many Web sites adopted the Robot protocol, which establishes guidelines that crawlers should follow. Over time, the protocol has become the unwritten law of the Internet for Web crawlers.

The Robot protocol specifies that Web sites wishing to restrict certain areas or pages from crawling have a file called robots.txt placed at the root of the Web site. Ethical crawlers will reference the robot file and determine which parts of the site are disallowed for crawling. The disallowed areas will then be skipped by the ethical crawlers. Following is an example robots.txt file and an explanation of its format:

# robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration  # Disallow robots on registration page
Disallow: /login

The first line of the sample file has a comment on it, as denoted by the use of a hash (#) character. Comments can be on lines unto themselves or on statement lines, as shown on the fifth line of the preceding sample file. Crawlers reading robots.txt files should ignore any comments.

The third line of the sample file specifies the User-agent to which the Disallow rules following it apply. User-agent is a term used for the programs that access a Web site. For example, when accessing a Web site with Microsoft’s Internet Explorer, the User-agent is “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)” or something similar to it. Each browser has a unique User-agent value that it sends along with each request to a Web server. Web crawlers also typically send a User-agent value along with each request to a Web server. The use of User-agents in the robots.txt file allows Web sites to set rules on a User-agent–by–User-agent basis. However, typically Web sites want to disallow all robots (or User-agents) access to certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that all User-agents are disallowed for the rules that follow it. You might be thinking that the use of an asterisk to disallow all User-agents from accessing a site would prevent standard browser software from working with certain sections of Web sites. This is not a problem, though, because browsers do not observe the Robot protocol and are not expected to.

The lines following the User-agent line are called disallow statements. The disallow statements define the Web site paths that crawlers are not allowed to access. For example, the first disallow statement in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus, the URLs

 http://somehost.com/cgi-bin/
 http://somehost.com/cgi-bin/register

are both off limits to crawlers according to that line. Disallow statements are for paths and not specific files; thus any link being requested that contains a path on the disallow list is off limits.


blog comments powered by Disqus
JAVA ARTICLES

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 
Support 

Developer Shed Affiliates

 




© 2003-2017 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials