This article, the second of two parts, examines the problems raised by the glut of information available through the web, and how to tame it. It is excerpted from the book Wicked Cool Java, written by Brian D. Eubanks (No Starch Press, 2005; ISBN: 1593270615).
Crawling the Semantic Web, concluded - Guess What? Publishing RSS Newsfeeds with Informa (Page 2 of 5 )
RDF Site Summary (RSS) is a standard for summarizing content on a web server. An RSS feed is stored in an XML file, and it might include items such as recent news, changes to a website, or new blog entries. A client program called an aggregator collects RSS feeds from multiple web servers and displays them in summary form, sorted by category. The user then chooses to view the full content of any summaries that are of interest. The summary has metadata, such as its subject, encoded along with a text summary. Over time I expect that document metadata will have much more than the Dublin Core and other terms that RSS currently uses. In theory, you could plug into other ontologies such as SUMO, and the meaning of an entire article could be encoded using RDF. This is possible only if you are using an ontology that is expressive enough. This is certainly a lot of effort, but the long-term advantage is that machines would have access to the fully encoded semantics of the text. This probably won’t happen for a while, but adding metadata such as RSS descriptions is a good start in that direction and has an immediate benefit of giving us more accurate categorization of content.
There are several standards named RSS, all of them XML-based and used for similar purposes. Unfortunately the different standards not only have different XML structures but even use different definitions for the RSS acronym. Most aggregators are able to understand all RSS flavors, though. The version we discuss here, RDF Site Summary 1.0, uses RDF and is most closely related to the semantic work we’ve done so far in this chapter. However, it’s still better to use something rather than encoding no metadata at all. There are ways to map between the semantics of each standard, although all of them are not equally expressive. One common practice is to use XSL-T stylesheets to transform between the different forms of RSS.
Because RSS 1.0 is built on RDF and XML, there are several ways of creating feeds: a DOM parser, an RDF API, or an RSS-specific API. DOM is more low-level than is necessary for creating RDF. Jena has RSS support through its RSS class, which has static objects that represent RSS properties you can use in building an RSS-compatible RDF graph. But if you’re going to be working a lot with RSS, you’ll want to use an RSS-specific API that can understand the different RSS versions that are commonly used.
Informa is an open-source API for reading and writing RSS in Java. One of its most powerful features is the ability to persist the feed metadata in a database. Informa can also read data from external feeds (as described in a later section), perform text-filtering tasks, and update RSS content on a periodic schedule. Let’s use it to create a feed using the basic in-memory builder—the ChannelBuilder class from the de.nava.informa.impl.basic package. In RSS terminology, a channel is another name for metadata about some content (such as a website) and is the main entity in a newsfeed. Each RSS file defines a channel and items belonging to the channel. Rather than work with the XML directly, which can be somewhat tedious, we’ll use a ChannelBuilder to create the RSS file.
ChannelBuilder builder = new ChannelBuilder(); ChannelIF myChannel = builder.createChannel("Latest Bug Fixes"); // This is the URL for which we are describing the metadata URL channelURL = new URL("http://example.org/wcj/bugs.rss"); myChannel.setLocation(channelURL); myChannel.setDescription("The latest news on our bug fixes");
// We create a first item String title = "Annoying Bug #25443 Now Fixed"; String desc = "A major bug in OurGreatApplication is fixed. " + "Bug #25443, which has been annoying users ever since 3.0, " + "was due to a rogue null pointer."; URL url = new URL("http://example.org/wcj/bugfix25443.html"); ItemIF anItem = builder.createItem(myChannel, title, desc, url); anItem.setCreator("Ecks Amples");
// We create a second item title = "Bug #12121 not Fixed in 7.1"; desc = "Bug #12121 will not be fixed in OurGreatApplication " + "release 7.1, so that developers can focus on adding " + "the WickedCool feature."; url = new URL("http://example.org/wcj/bugfix12121.html"); anItem = builder.createItem(myChannel, title, desc, url); anItem.setCreator("Dee Veloper");
// export the document to disk, in RSS 1.0 format ChannelExporterIF exporter = new RSS_1_0_Exporter("bugs.rss"); exporter.write(myChannel);
You can place the XML-encoded RSS feed anywhere on your site. The main page of your site should include a link to the feed. For automated discovery by RSS crawlers such as Syndic8, you can do this with a link tag in the page’s head section:
You’ll also want a hypertext link for human visitors, so they can add your site to their aggregator. If you are going to be creating large feeds that change often or working with many feeds simultaneously, use the Hibernate -based version of the builder, which will persist the RSS metadata in a database. Hibernate is an API for mapping Java objects to relational database structures and automatically translating data between them. See the Informa documentation, and this section’s resource page, for more information. In the next section, we’ll see how to read newsfeeds with Informa.