Home arrow Java arrow Page 13 - Crawling the Web with Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

Author Info:
By: McGraw-Hill/Osborne
Rating: 4 stars4 stars4 stars4 stars4 stars / 87
June 09, 2005
  1. · Crawling the Web with Java
  2. · Fundamentals of a Web Crawler
  3. · An Overview of the Search Crawler
  4. · The SearchCrawler Class part 1
  5. · The SearchCrawler Class part 2
  6. · SearchCrawler Variables and Constructor
  7. · The search() Method
  8. · The showError() and updateStats() Methods
  9. · The addMatch() and verifyURL() Methods
  10. · The downloadPage(), removeWwwFromURL(), and
  11. · An Overview of Regular Expression Processing
  12. · A Close Look at retrieveLinks()
  13. · The searchStringMatches() Method
  14. · The crawl() Method
  15. · Compiling and Running the Search Web Crawler

print this article

Crawling the Web with Java - The searchStringMatches() Method
(Page 13 of 15 )

The searchStringMatches( ) method, shown here, is used to search through the contents of a Web page downloaded during crawling, determining whether or not the specified search string is present in the page:

  /* Determine whether or not search string is
     present in the given page contents. */
  private boolean searchStringMatches(
    String pageContents, String searchString,
    boolean caseSensitive)
    String searchContents = pageContents;
/* If case-sensitive search, lowercase
       page contents before comparison. */
  if (!caseSensitive) {
    searchContents = pageContents.toLowerCase();
  // Split search string into individual terms.
  Pattern p = Pattern.compile("[\\s]+");
  String[] terms = p.split(searchString);
  // Check to see if each term matches.
  for (int i = 0; i < terms.length; i++) {
    if (caseSensitive) {
      if (searchContents.indexOf(terms[i]) == -1) {
        return false;
    } else {
      if (searchContents.indexOf(terms[i].toLowerCase()) == -1)  {
        return false;
  return true;

Because the search string can be either case insensitive (default) or case sensitive, searchStringMatches( ) starts out by declaring a local variable, searchContents, that refers to the string to be searched. By default, the pageContents variable is assigned to searchContents. If the search is case sensitive, however, the searchContents variable is set to a lowercased version of the pageContents string.

Next, the search string is split into individual search terms using Java’s regular expression library. To split the search string, first, a regular expression pattern is compiled with the Pattern object’s static compile( ) method. The pattern used here, "[\\s]+", states that one or more white space characters (that is, spaces, tabs, or newlines) should be matched. Second, the compiled Pattern’s split( ) method is invoked with the search string, which yields a String array containing individual search terms.

After breaking the search string up, the individual terms are cycled through, checking to see if each term is found in the page’s contents. The indexOf( ) method defined by String is used to search through the searchContents variable. A return value of –1 indicates that the search term was not found, and thus false is returned since all terms must be found in order to have a match. Notice that if the search is case insensitive, the search term is lowercased in the comparison. This coincides with the value assigned to the searchContents variable at the beginning of this method. If the for loop executes in its entirety, the searchStringMatches( ) method concludes by returning true, indicating that all terms in the search string matched.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials