This article, the first of two parts, examines the problems raised by the glut of information available through the web, and how to tame it. It is excerpted from the book Wicked Cool Java, written by Brian D. Eubanks (No Starch Press, 2005; ISBN: 1593270615).
Crawling the Semantic Web - Triple the Fun: Creating an RDF Vocabulary for Your Organization (Page 3 of 6 )
An RDF graph creates a web of concepts. It makes assertions about logical relationships between entities. RDF was meant to fit into a dynamic knowledge representation system rather than a static database structure. Once you have information in RDF, it can be linked with graphs made elsewhere, and software can use this to make inferences. If you define how your own items are related in terms of higher-level concepts, your data can fit into a much larger web of concepts. This is the basis of the Semantic Web.
Every organization has relationships between information that is held in a data store such as a database or flat file (or human memory!). If your data is in a relational database, your data items probably have relationships between them that are hidden or implied within the database structure itself.
Your data may not be completely accessible, because there are relationships that an application cannot query. As an example, suppose that we have a relational database containing employees and departments within a company. A common approach is to create an Employee table, with columns for employee information such as ID number, date of birth, name, hire date, supervisor name, and department. There are many relationships hidden within the table and column names, and it is up to an application to know these relationships and take advantage of them. Column names alone would not give you the following information:
A and B are employees.
An employee is a person.
A supervisor is an employee who directs another employee.
C is a company.
A company is an organization.
A and B work for C.
Column and table names in a database are simply local identifiers and donít automatically map to any concepts that might be defined elsewhere. But this is domain knowledge that could be used more effectively by the application if it were defined in an extensible and machine-readable way. Having such information available would give our applications more flexibility, and this knowledge could also be reused elsewhere. How can we encode this information so that applications can make use of these relationships? And how can our application relate this to other information that we might find on the Semantic Web?
It may not make sense to put this metadata in your database, but you can create an RDF mapping outside the database schema that describes each item relative to the Semantic Web as a whole. We can represent some of these concepts using existing vocabularies. The rest of them we can define in our own terms. If you donít know where to connect a concept to an existing vocabulary, you can always define a URI for that concept now and make the connection to other systems later. At least you can use it to share data within your own organization if your vocabulary is well documented and the meaning of each item is clear. There are many basic vocabularies that RDF applications can use, and new ones are constantly being created (like yours!). The online resources page for this section has an updated listing of some existing vocabularies that you can use in defining your data.
The first step is to define a URI for each concept that is even remotely related to your application. This is much like the object-oriented development process, but these entities may also be things that are not directly used by the application. By defining your terms within a larger context, you can later map these entities to existing concepts on the Web. Letís try it with our employee example, by first listing some related concepts and their meanings (in English text). Here is a simplistic attempt to define some terms:
http://example.org/wcjava/employee = an employee
http://example.org/wcjava/person = a person
http://example.org/wcjava/organization = an organization
http://example.org/wcjava/employer = an organization that employs an employee
The important point is to make sure that each concept has a unique identifier. Make sure that the URIs will still be around a few years from now; you are building a complete concept space around these identifiers! If you have control over your domain name, it might be wise to have a policy that forbids anyone placing actual content under URIs beginning with some prefix (such as http://yourdomain/uri). We are using these names as globally unique identifiers, not as URLs for retrieving documents. There is nothing wrong with a document being there, but it could lead to confusion between the concept and the document. In this example, we are using the example.org domain, which is reserved solely for illustrative purposes within documentation. If you want to define a permanent URI, there are sites that will let you define your own permanent URI independent of future domain name ownership changes. (For more information on this, see this bookís companion website.) The best known of these is http://purl.org.
After you have identified some concept URIs, itís time to define relationships between them. In the previous section, we showed how to do this in Jena using our own relationships. Now letís use some predefined relationships created by others and apply them to our entities. Adding another entity that was defined elsewhere is easy: just add its URI to the graph we are building. But if we want to do anything useful with these entities, we will also need to import the statements that define its related properties and resources. In our example, we will use the subClassOf property defined in the RDF schema, which works similarly to a subclass relationship in object-oriented programming. The graph in Figure 4-3 shows the relationships between our resources.
Figure 4-3:Using the subClassof property from RDF Schema
At first, you should do this mapping with pen and paper (archaic, but always accessible) or using an RDF visualization tool. This bookís website has a list of some free tools that can be used for this purpose. When you have finished, you will have a graph of the relationships between entities in your system. Once youíve created a hierarchy and vocabulary, you can create N3 or RDF/XML files that you can use as metadata. Most RDF visualization tools will do this for you automatically. Youíll want to familiarize yourself with some of the existing RDF vocabularies on which you can base your own hierarchy. Our resources page has links to some of these and examples of using them. Once you have designed a hierarchy, you can create and manipulate it from Jena. The next section shows how to do this.