Home arrow Java arrow Crawling the Web with Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article

Crawling the Web with Java
(Page 1 of 15 )

Have you ever wondered how Internet search engines like Google and Yahoo! can search the Internet on virtually any topic and return a list of results so quickly? Obviously it would be impossible to scour the Internet each time a search request was initiated. Instead search engines query highly optimized databases of Web pages that have been aggregated and indexed ahead of time. Compiling these databases ahead of time allows search engines to scan billions of Web pages for something as esoteric as “astrophysics” or as common as “weather” and return the results almost instantly.

The real mystery of search engines does not lie in their databases of Web pages, but rather in how the databases are created. Search engines use software known as Web crawlers to traverse the Internet and to save each of the individual pages passed by along the way. Search engines then use additional software to index each of the saved pages, creating a database containing all the words in the pages.

Web crawlers are an essential component to search engines; however, their use is not limited to just creating databases of Web pages. In fact, Web crawlers have many practical uses. For example, you might use a crawler to look for broken links in a commercial Web site. You might also use a crawler to find changes to a Web site. To do so, first, crawl the site, creating a record of the links contained in the site. At a later date, crawl the site again and then compare the two sets of links, looking for changes. A crawler could also be used to archive the contents of a site. Frankly, crawler technology is useful in many types of Web-related applications.

Although Web crawlers are conceptually easy in that you just follow the links from one site to another, they are a bit challenging to create. One complication is that a list of links to be crawled must be maintained, and this list grows and shrinks as sites are searched. Another complication is the complexity of handling absolute versus relative links. Fortunately, Java contains features that help make it easier to implement a Web crawler. First, Java’s support for networking makes downloading Web pages simple. Second, Java’s support for regular expression processing simplifies the finding of links. Third, Java’s Collection Framework supplies the mechanisms needed to store a list of links.

The Web crawler developed in this chapter is called Search Crawler. It crawls the Web, looking for sites that contain strings matching those specified by the user. It displays the URLs of the sites in which matches are found. Although Search Crawler is a useful utility as is, its greatest benefit is found when it is used as a starting point for your own crawler-based applications.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials