Home arrow Java arrow Page 5 - Crawling the Semantic Web, concluded

Crawling the Semantic Web, concluded

This article, the second of two parts, examines the problems raised by the glut of information available through the web, and how to tame it. It is excerpted from the book Wicked Cool Java, written by Brian D. Eubanks (No Starch Press, 2005; ISBN: 1593270615).

Author Info:
By: No Starch Press
Rating: 5 stars5 stars5 stars5 stars5 stars / 7
March 02, 2006
  1. · Crawling the Semantic Web, concluded
  2. · Guess What? Publishing RSS Newsfeeds with Informa
  3. · What’s Up? Aggregating RSS Newsfeeds
  4. · Heading to the Polls: Polling RSS Feeds with Informa
  5. · All the News Fit to Print: Filtering RSS Feeds with Informa

print this article

Crawling the Semantic Web, concluded - All the News Fit to Print: Filtering RSS Feeds with Informa
(Page 5 of 5 )

In the previous section, we polled an RSS feed and wrote some code that automatically updates our copy of the Channel object whenever the feed changes. Our PollerObserverIF implementation added the item to a Channel object. You may think that the observer would be a good candidate for doing some filtering of the feed content, such as deciding whether to add new items to our copy. This could work, but since there can be more than one observer connected to a Poller, it’s better to have a separate object do the filtering. By doing this, we won’t need to duplicate any filtering functions, and all the observers can benefit equally from the filtering process.

Informa implements filters through an approval process. You can add one or more approvers to a Poller. The observers will see a new item only if all of the approvers accept it. The approval must be a unanimous vote or the change will remain invisible to the observers (that is, the observers’ newItem method is not called). To add an approver, implement the PollerApproverIF interface and pass it to the Poller’s addApprover method. By making fine-grained approvers, you can use them in a plug-and-play manner. For example, you could have a NoBadWordsApprover that checks for the existence of words that you don’t want to appear on your website or to be added to the Channel. In a similar way, a RelevancyApprover class could check for keywords that are relevant to your intended usage of the feed.

Approvers check properties within each item, such as the category list and subject, to determine whether an item should be approved. Poller-ApproverIF has only a single method, as indicated in this example that checks the title and the description of each item using regular expressions (as discussed in Chapter 2). Here is the approver class:

public class RelevancyApprover
implements PollerApproverIF {
public boolean canAddItem(ItemIF item, ChannelIF channel) {
    String title = item.getTitle();
    String description = item.getSubject();
    if (title.matches(".*Java.*") || description.matches(".*Java.*"))
return true;
    } else {
      return false;

As you might guess, this approver accepts only items that have “Java” somewhere in the title or description. The next code fragment adds this approver to a Poller. The approver should be added before the observer, and the observer added before registering the channel:

Poller poller = new Poller();
poller.addApprover(new RelevancyApprover());
poller.addObserver(new AnObserver());

There is another class similar to the Poller, the Cleaner, that can periodically remove unwanted items in a channel. It uses a similar process: CleanerObserverIF observers are added to a Cleaner, and CleanerMatcherIF instances decide what should be removed. Perhaps these interfaces should be called “JuryMember” and “Executioner,” because that is a very good metaphor for what they do! You might use the Cleaner to remove items that are older than a few days or meet some other criteria for removal. For both the PollerApproverIF and CleanerMatcherIF decision making, you might want to integrate Lucene text matching, as described in Chapter 3. This would give much more sophisticated text-matching abilities, such as similarity (“fuzzy”) matches.

Chapter Summary

The techniques of semantic tagging that we’ve described in this chapter are quickly becoming popular in large published data sets, and in the next few years the Semantic Web will see an exponential growth. The latest news and website updates, along with what your colleagues are blogging, are already being gathered automatically by RSS aggregators and organized by category. In business-to-business transactions, common high-level ontologies are beginning to connect domains with completely different terminology in ways that were impossible before. For example, within highly specific scientific disciplines, new discoveries often use domain-specific terms to describe their findings. This information could lead to breakthroughs in other disciplines, if it were only translated into the appropriate terminology.

Structured newsfeeds are already bringing current news and other information to anyone with an aggregator and a network connection. Using more detailed semantic markup (with SUMO or other high-level ontologies), information could be made even more accessible to everyone—even if the original document uses obscure terminology or a foreign language. We will soon see new types of aggregators and intelligent agents that make logical inferences based on the news and perhaps act on our behalf. Organizations that are properly prepared for this will be able to use the Semantic Web much more effectively. One way to start preparing now is by identifying each type of data with a URI, adding a machine-readable RDF type description (for example, that the item is a person, hardware, software, or some other entity), and using standard ontologies where possible. Jena, Informa, and the ontologies discussed in this chapter are some tools that can help you with this process. In the next chapter, we discuss intelligent software agents and explore some of the scientific and mathematical APIs for Java.

DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.

blog comments powered by Disqus

- Java Too Insecure, Says Microsoft Researcher
- Google Beats Oracle in Java Ruling
- Deploying Multiple Java Applets as One
- Deploying Java Applets
- Understanding Deployment Frameworks
- Database Programming in Java Using JDBC
- Extension Interfaces and SAX
- Entities, Handlers and SAX
- Advanced SAX
- Conversions and Java Print Streams
- Formatters and Java Print Streams
- Java Print Streams
- Wildcards, Arrays, and Generics in Java
- Wildcards and Generic Methods in Java
- Finishing the Project: Java Web Development ...

Watch our Tech Videos 
Dev Articles Forums 
 RSS  Articles
 RSS  Forums
 RSS  All Feeds
Write For Us 
Weekly Newsletter
Developer Updates  
Free Website Content 
Contact Us 
Site Map 
Privacy Policy 

Developer Shed Affiliates


© 2003-2019 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap
Popular Web Development Topics
All Web Development Tutorials