XML is gaining acceptance today, not because it is a great technology looking for a problem, but because today's problems require its flexibility and simplicity. In this article Doug talks about how XML can be used to accomodate human-authored content. He also discusses structured and unstructured data as well as tips for designing XML DTD's and more.
XML Unlocks Information - How XML Accommodates Human-Authored Content (Page 2 of 5 )
While highly structured data is independent of the style used to present it, unstructured data is full of style and format. Contrast plain text (no style) with rich text (full of style).
Text documents meant for human authoring and reading have design needs that only XML can address. Examples of semi-structured documents include catalogs, press releases, news reports, and technical documentation. Even highly structured data becomes semi-structured if it includes comments, descriptions, or instructions meant to be read by people.
XML supports the development of semi-structured documents that contain both relational meta data (the structure) and free-form (unstructured) formatted text. The meta data (that is, the XML tags) meets the programmatic need for structure. Without meta data, a computer program cannot understand the content. Formatted text meets the human and business need to express richly styled content. Without style, the content is dry and unattractive.
The paragraph you are reading now is an example of formatted text. Most document editors display content (unstructured data) as WYSIWYG (what you see is what you get). For a business user to comfortably create semi-structured textual documents, a document editor must allow the author to add style to the text.
Variations of Structured and Unstructured Data
Two kinds of semi-structured data exist between highly structured and unstructured data:
highly structured data
structured data with unstructured elements
unstructured documents with tagged meta data
Structured data with unstructured elements is commonly used in web forms, where most fields are tightly constrained (for example, "State" must be selected from a list and "ZIP" must be all digits), yet a 'comment' field is available for human-readable content.
<product> <name>Deluxe Widget</name> <listprice units="usd">$19.95</listprice> <radius>6mm</radius> <description> This <em>deluxe <strong>gold</strong> plated</em> product fits most attachments. </description> </product>
For this kind of document, use a DTD or schema to validate the structure, and include an unstructured element (for example, description) that allows both text and tags. In a DTD, this element would typically be defined as
<!ELEMENT description ANY>
Unstructured documents with tagged meta data are less common but offer the best promise for content that can be effectively searched. HTML provides some meta tags, like <ADDRESS> and <CODE>, but XML provides the flexibility to create custom tags.
<owner studentid="2456">Jim Smith</owner> owns a <automobile model="OCC96">Cutlass Ciera</automobile>. <my:conditional value="birds"> <my:reference> <my:author>Joe Kluck</my:author> in his article <my:title type="article">Why Chicken have Wings</my:title> <my:bibliography>(<my:source><my:periodical>Poultry Monthly</my:periodical> <my:issue>September 2001</my:issue></my:source>, page <my:page>9</my:page>)</my:bibliography> dispels the usual stereotypes of flightless birds." </my:reference> </my:conditional>
This kind of document must be well formed to allow processing by an XML parser but is usually not validated against a DTD or schema. For such a document, XHTML is a natural choice because it is well formed, has extensive formatting capability, and custom XML tags can be added without causing display problems in browsers. Note the namespace "my" was used to distinguish the custom XML tags from standard HTML tags.