Do you cringe when you hear the words "XML"? Are you just not sure what the heck this acronym is all about? Not to worry! In this article, Liviu introduces us to XML, the various ways it can be created and parsed, as well as a brief introduction to XML and the way it came to be.
DTDs, or Document Type Definitions, are a means of establishing an XML “grammar”. They rely on a specialized syntax for describing the structure of an XML set of documents. (This is actually one of the XML community’s biggest complaints about the DTD, as it doesn’t make too much sense to learn another syntax for describing the contents of an XML document, when XML itself can be used for this, as we will find in the next section about schemas.) Although the DTD structure is compact, the way it actually describes the XML contents is rather cryptical and it has its limitations and drawbacks. Still, it’s not a hard task to learn the DTD syntax and create a DTD for simple XML documents (as we will see shortly, for documents that contain complex data structures, it is recommended to use XML Schemas). Let’s try to describe now the contents of an XML document containing data about a company’s employees; we assume each employee data will be stored in a separate XML document, and an employee should have (at least) the following properties:
Date of birth
If we take the approach of using attributes to describe an employee, then our DTD will look like this:
This defines a structure that has three mandatory attributes (marked with “#REQUIRED” and three optional ones – we have to give our employees the right to keep their home number secret in order to avoid being called into the office late at night ;). All attributes are declared as “character data” (CDATA) – therefore we will expect strings in these fields.
A short explanation is needed here: there are two types of “character data” in XML: parsed and un-parsed. The un-parsed character data are represented by a CDATA tag – in such case, data arrives exactly as it is stored in the XML document. For example, for a value of “CDATA&NOT CDATA” for one of the attributes, the corresponding string will be “CDATA&NOT CDATA”. Parsed character data – or PCDATA – on the other hand allows us to include escape characters in the attribute values; these escapes characters will be parsed and translated into the corresponding characters which will be passed back to the application. For those of you who have done some HTML coding, the following sequences will make sense: & " etc.; such sequences correspond to & (ampersand), “ (quote) and so on – that’s exactly what will happen with PCDATA sections. In the case of the above-mentioned string, “CDATA&NOT CDATA” the final string passed to the application will be “CDATA&NOT CDATA”.
Let’s have a look at what an XML document conforming to the above document would look like:
However, in neither of these composites, have we specified which DTD this XML is conforming to! So, how would an application know where to look for the DTD? Well, it doesn’t! We have to specify in the actual XML document where the DTD can be found. To do this, we have two options:
Embedding the actual DTD in the XML body
Creating an external DTD and reference it in our XML document
The first option is easy and is suitable for small DTDs, in the case of XMLs that are rarely generated and used. The XML above will be transformed as follows:
The second one is just as easy and it is indicated in the case where either the DTD is substantial in size/complexity, or it is known that at one time there will be more than one document to be parsed which will be using this DTD (so the parsers can cache it) – or both! (Of course, these are just a few of the considerations that should be kept in mind when isolating the DTD from the XML document or embedding it within. There are cases where practice dictates otherwise, however, from my experience, they work as a general rule of thumb.)
XML Document: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE Employee SYSTEM “employee.dtd”> <Employee name="Liviu" surname="Tudor" dob="14/02/1975" email=”firstname.lastname@example.org” address=”Coocooland”/>
As you may have noticed in the above example, we have placed a reference to the DTD file name. The good news is that this doesn’t have to be a path to the file system – it can be a URL to a well-known location on the Internet (or intranet) to a DTD that has been, for example, defined by an international body and to which your document has to adhere.
In fact, once the XML/DTD paradigm came out there were a lot of companies from the same field teaming up to construct DTDs for different industries – in order to allow integration between different computer systems for that specific industry; these DTDs, once established, were (and are still) published in a “well-known” location on the web and, therefore, your document only had to reference the URL to the DTD, rather than distributing a DTD with your XML file all the time.
Now, knowing how to build up a DTD, and how to reference it in our XML document doesn’t automatically make our document perfect. An XML document can be valid (it can respect the XML rules about constructing tags and so on), but it might not be well formed (that is, it might not adhere to the rules laid out in the DTD). In order for an XML document to be “process-able” by another application, it has to be both valid and well formed.
DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware.