Java
  Home arrow Java arrow Page 12 - Crawling the Web with Java
Dev Articles Forums 
ADO.NET  
Apache  
ASP  
ASP.NET  
C#  
C++  
ColdFusion  
COM/COM+  
Delphi-Kylix  
Design Usability  
Development Cycles  
DHTML  
Embedded Tools  
Flash  
Graphic Design  
HTML  
IIS  
Interviews  
Java  
JavaScript  
MySQL  
Oracle  
Photoshop  
PHP  
Reviews  
Ruby-on-Rails  
SQL  
SQL Server  
Style Sheets  
VB.Net  
Visual Basic  
Web Authoring  
Web Services  
Web Standards  
XML  
Mobile Linux 
App Generation ROI 
IBM® developerWorks 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us Get Paid 
Request Media Kit
Contact Us 
Site Map 
Privacy Policy 
Support 
 USERNAME
 
 PASSWORD
 
 
  >>> SIGN UP!  
  Lost Password? 
JAVA

Crawling the Web with Java
By: McGraw-Hill/Osborne
  • Search For More Articles!
  • Disclaimer
  • Author Terms
  • Rating: 4 stars4 stars4 stars4 stars4 stars / 49
    2005-06-09

    Table of Contents:
  • Crawling the Web with Java
  • Fundamentals of a Web Crawler
  • An Overview of the Search Crawler
  • The SearchCrawler Class part 1
  • The SearchCrawler Class part 2
  • SearchCrawler Variables and Constructor
  • The search() Method
  • The showError() and updateStats() Methods
  • The addMatch() and verifyURL() Methods
  • The downloadPage(), removeWwwFromURL(), and
  • An Overview of Regular Expression Processing
  • A Close Look at retrieveLinks()
  • The searchStringMatches() Method
  • The crawl() Method
  • Compiling and Running the Search Web Crawler

  • Rate this Article: Poor Best 
      ADD THIS ARTICLE TO:
      Del.ici.ous Digg
      Blink Simpy
      Google Spurl
      Y! MyWeb Furl
    Email Me Similar Content When Posted
    Add Developer Shed Article Feed To Your Site
    Email Article To Friend
    Print Version Of Article
    PDF Version Of Article
     
     
    ADVERTISEMENT


    Crawling the Web with Java - A Close Look at retrieveLinks()


    (Page 12 of 15 )

    The retrieveLinks( ) method uses the regular expression API to obtain the links from a page. It begins with these lines of code:

    // Compile link matching pattern.
    Pattern p =
      Pattern.compile("<a\\s+href\\s*=\\s*\"?(.*?)[\"|>]", 
        Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(pageContents);

    The regular expression used to obtain links can be broken down as a series of steps, as shown in the following table:

    Character Sequence

    Explanation

    <a

    Look for the characters "<a".

    \\s+

    Look for one or more space characters.

    href

    Look for the characters "href".

    \\s*

    Look for zero or more space characters.

    =

    Look for the character "=".

    \\s*

    Look for zero or more space characters.

    \"?

    Look for zero or one quote character.

    (.*?)

    Look for zero or more of any character until the next part of the pattern is matched, and place the results in a group.

    [\"|>]

    Look for quote character or greater than (">") character.

    Notice that Pattern.CASE_INSENSITIVE is passed to the pattern compiler. As mentioned, this indicates that the pattern should ignore case when searching for matches.

    Next, a list to hold the links is created, and the search for the links begins, as shown here:

    // Create list of link matches.
    ArrayList linkList = new ArrayList();
    while (m.find()) {
     
    String link = m.group(1).trim();

    Each link is found by cycling through m with a while loop. The find( ) method of Matcher returns true until no more matches are found. Each match (link) found is retrieved by calling the group( ) method defined by Matcher. Notice that group( ) takes 1 as an argument. This specifies that the first group from the matching sequences be returned. Notice also that trim( ) is called on the return value from the group( ) method. This removes any unnecessary leading or trailing space from the value.

    Many of the links found in Web pages are not suited for crawling. The following code filters out several links that the Search Crawler is uninterested in:

    // Skip empty links.
    if (link.length() < 1) {
      continue;
    }
    // Skip links that are just page anchors.
    if (link.charAt(0) == '#') {
      continue;
    }
    // Skip mailto links.
    if (link.indexOf("mailto:") != -1) {
      continue;
    }
    // Skip JavaScript links.
    if (link.toLowerCase().indexOf("javascript") != -1) {
      continue;
    }

    First, empty links are skipped so as not to waste any more time on them. Second, links that are simply anchors into a page are skipped by checking to see if the first character of the link is a hash (#).

    Page anchors allow for links to be made to a certain section of a page. Take, for example, this URL:

      http://osborne.com/#contact

    This URL has an anchor to the “contact” section of the page located at http://osborne.com. Links inside the page at http://osborne.com can reference the section relatively as just “#contact”. Since anchors are not links to “new” pages, they are skipped over.

    Next, “mailto” links are skipped. Mailto links are used for specifying an e-mail link in a Web page. For example, the link

      mailto:books@osborne.com

    is a mailto link. Since mailto links don’t point to Web pages and cannot be crawled, they are skipped over. Finally, JavaScript links are skipped. JavaScript is a scripting language that can be embedded in Web pages for adding interactive functionality to the page. Additionally, JavaScript functionality can be accessed from links. Similar to mailto links, JavaScript links cannot be crawled; thus they are overlooked.

    As you’ve just seen, the links in Web pages can take many formats, such as mailto and JavaScript formats. Additionally, traditional links inside Web pages can take a few different formats as well. Following are the three formats that traditional links can take:

    The first of the three links shown here is considered to be a fully qualified URL. The second example is a shortened version of the first URL, omitting the “host” portion of the URL. Notice the slash (/) at the beginning of the URL. The slash indicates that the URL is what’s called “absolute.” Absolute URLs are URLs that start at the root of a Web site. The third example is again a shortened version of the first URL, omitting the “host” portion of the URL. Notice that this third example does not have the leading slash. Since the leading slash is absent, the URL is considered to be “relative.” Relative, in the realm of URLs, means that the URL address is relative to the URL on which the link is found.

    The lines of code in the next section handle converting absolute and relative links into fully qualified URLs:

    // Prefix absolute and relative URLs if necessary.
    if (link.indexOf("://") == -1) {
     
    // Handle absolute URLs.
     
    if (link.charAt(0) == '/') {
       
    link = "http://" + pageUrl.getHost() + link;
     
    // Handle relative URLs.
     
    } else {
       
    String file = pageUrl.getFile();
        if (file.indexOf('/') == -1) {
          link = "http://" + pageUrl.getHost() + "/" + link;
        }else {
         String path =
           file.substring(0, file.lastIndexOf('/') + 1);
         link = "http://" + pageUrl.getHost() + path + link;
        }
      }
    }

    First, the link is checked to see whether or not it is fully qualified by looking for the presence of "://" in the link. If these characters exist, the URL is assumed to be fully qualified. However, if they are not present, the link is converted to a fully qualified URL. As discussed, links beginning with a slash (/) are absolute, so this code adds "http://" and the current page’s URL host to the link to fully qualify it. Relative links are converted here in a similar fashion.

    For relative links, the current page URL’s filename is taken and checked to see if it contains a slash (/). A slash in the filename indicates that the file is in a directory hierarchy. For example, a file may look like this:

              dir1/dir2/file.html
    or simply like this:
              file.html

    In the latter case, "http://", the current page’s URL host, and "/" are added to the link since the current page is at the root of the Web site. In the former case, the “path” (or directory) portion of the filename is retrieved to create the fully qualified URL. This case concatenates "http://", the current page’s URL host, the path, and the link together to create a fully qualified URL.

    Next, page anchors and "www" are removed from the fully qualified link:

    // Remove anchors from link.
    int index = link.indexOf('#');
    if (index != -1) {
     
    link = link.substring(0, index);
    }
    // Remove leading "www" from URL's host if present.
    link = removeWwwFromUrl(link);

    For the same reason that anchor-only links are skipped over, links with anchors tacked on to the end are skipped over. The leading "www" is also removed from links so that duplicate links are skipped over later in this method.

    Next, the link is verified to make sure it is a valid URL:

    // Verify link and skip if invalid.
    URL verifiedLink = verifyUrl(link);
    if (verifiedLink == null) {
      continue;
    }

    After validating that the link is a URL, the following code checks to see if the link’s host is the same as the one specified by Start URL and checks to see if the link has already been crawled:

    /* If specified, limit links to those
       having the same host as the start URL. */
    if (limitHost &&
        !pageUrl.getHost().toLowerCase().equals(
          verifiedLink.getHost().toLowerCase()))
    {
      continue;
    }
    // Skip link if it has already been crawled.
    if (crawledList.contains(link)) {
      continue;
    }

    Finally, the retrieveLinks( ) method ends by adding each link that passes all filters to the link list.

    // Add link to list.
    linkList.add(link);
    }
    return (linkList);

    After the while loop finishes and all links have been added to the link list, the link list is returned.

    More Java Articles
    More By McGraw-Hill/Osborne


       · I'm tring the web crawler for big samples (I mean to limit it to 10000 pages) but...
       · The HashSet isn't the only possible cause of out of memory errors. This program is...
       · The biggest problem with this program is that it doesn't transform relative links to...
       · It does transform relative links of the form "foo/bar.html", that is, links which...
       · ok
       · Hi,I d like to report a problem that i have and ask you if you all do have this...
       · Hi,I can't find the source code of this project.Could somebody send it to me to...
       · hi am also working wid the same topic.if u could get code plz frwrd it 2 me...
       · I used the code from this particular topic to generate the searchcrawler,I'm pretty...
     

    Buy this book now. This article was taken from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0596007388). Check it out at your favorite bookstore. Buy this book now.

    JAVA ARTICLES

    - Deploying Multiple Java Applets as One
    - Deploying Java Applets
    - Understanding Deployment Frameworks
    - Database Programming in Java Using JDBC
    - Extension Interfaces and SAX
    - Entities, Handlers and SAX
    - Advanced SAX
    - Conversions and Java Print Streams
    - Formatters and Java Print Streams
    - Java Print Streams
    - Wildcards, Arrays, and Generics in Java
    - Wildcards and Generic Methods in Java
    - Finishing the Project: Java Web Development ...
    - Generics and Limitations in Java
    - Getting Started with Java Web Development in...







    © 2003-2009 by Developer Shed. All rights reserved. DS Cluster 1 Hosted by Hostway
    Stay green...Green IT