Exploring JDBC and XML - Using XML
(Page 3 of 4 )
One of Java's main selling points is that the language produces programs that can run on different operating systems without modification. The portability of software is a big convenience in today's computing environment, where Windows, Linux, Mac OS, and a half-dozen other operating systems are in wide use and many people work with multiple systems.
XML, which stands for Extensible Markup Language, is a format for storing and organizing data that is independent of any software program that works with the data.
Data that is compliant with XML is easier to reuse for several reasons.
First, the data is structured in a standard way, making it possible for software programs to read and write the data as long as they support XML. If you create an XML file that represents your company's employee database, there are several dozen XML parsers that can read the file and make sense of its contents.
This is true no matter what kind of information you collect about each employee. If your database contains only the employee's name, ID number, and current salary, XML parsers can read it. If it contains 25 items, including birthday, blood type, and hair color, parsers can read that, too.
Second, the data is self-documenting, making it easier for people to understand the purpose of a file just by looking at it in a text editor. Anyone who opens your XML employee database should be able to figure out the structure and content of each employee record without any assistance from you.
This is evident in Listing 20.4, which contains an XML file.
Listing 20.4 The Full Text of collection.librml
1: <?xml version="1.0"?>
2: <!DOCTYPE Library SYSTEM "librml.dtd">
3: <Library>
4: <Book>
5: <Author>Joseph Heller</Author>
6: <Title>Catch-22</Title>
7: <PubDate edition="Trade"
isbn="0684833395">09/1996</PubDate>
8: <Publisher>Simon and Schuster</Publisher>
9: <Subject>Fiction</Subject>
10: <Review>heller-catch22.html</Review>
11: </Book>
12: <Book>
13: <Author>Kurt Vonnegut</Author>
14: <Title>Slaughterhouse-Five</Title>
15: <PubDate edition="Paperback"
isbn="0440180295">12/1991</PubDate>
16: <Publisher>Dell</Publisher>
17: <Subject>Fiction</Subject>
18: </Book>
19: </Library>
Enter this text using a word processor or text editor and save it as plain text under the name collection.librml. (You can also download a copy of it from the book's Web site at http://www.java21days.com on the Day 20 page.)
Can you tell what the data represents? Although the ?xml and !DOCTYPE tags at the top may be indecipherable, the rest is clearly a book database of some kind.
The ?xml tag in the first line of the file has an attribute called version with a value of 1.0. All XML files must begin with an ?xml tag like this.
Data in XML are surrounded by tag elements that describe the data. Start tags begin with a < character followed by the name of the tag and a > character. End tags begin with the </ characters followed by a name and a > character. In Listing 20.4, for example, <Book> on line 12 is a start tag, and </Book> on line 18 is an end tag. Everything within those tags is considered to be the value of that element.
Tags can be nested within other tags, creating a hierarchy of XML data that establishes relationships within that data. In Listing 20.4, everything in lines 13–17 is related; each tag defines something about the same book.
XML also supports tag elements defined by a single tag rather than a pair of tags. These tags begin with a < character followed by the name of the tag and the /> characters. For example, the book database could include an <outOfPrint/> tag that indicates a book isn't presently available for sale.
Tag elements also can include attributes, which are made up of data that supplements the rest of the data associated with the tag. Attributes are defined within a start tag element. The name of an attribute is followed by an equal sign and text within quotation marks. In Line 7 of Listing 20.4, the PubDate tag includes two attributes: edition, which has a value of "Trade", and isbn, which has a value of "0684833395".
XML encourages the creation of data that's understandable and usable even if the user doesn't have the program that created it and cannot find any documentation that describes it.
Data that follows XML's formatting rules is said to be well-formed. Any software that can work with XML reads and writes well-formed XML data.
By insisting on well-formed markup, XML simplifies the task of writing programs that work with the data.
One of the motivations behind the development of XML in 1996 was the inconsistency of HTML. It's a wildly popular way to organize data for presentation to users, but Web browsers have always been designed to allow for inconsistent use of HTML tags. Web page designers can break numerous rules of valid HTML as it's defined by the World Wide Web Consortium, and their work still loads normally into a browser such as Mozilla or Internet Explorer. Millions of people are putting content on the Web without paying heed to valid HTML at all. They test their content to make sure that it's viewable in Web browsers, but they don't worry whether it's structured according to all the rules of HTML.
Note - The World Wide Web Consortium, founded by Web inventor Tim Berners-Lee, is the group that developed HTML and maintains the standard version of the language. You can find out more from the consortium Web site at http://www.w3.org. If you want to validate a Web page to see whether it follows all the rules of standard HTML, visit http://validator.w3.org.
There's strong demand on the Internet for software that collects data from Web pages and interacts with services offered over the Internet, such as e-commerce shopping agents that collect price and availability data from online stores, enabling customers to do price comparisons. The developers of services like this quickly run into the inconsistency in how HTML is used to organize Web content. Even if you can write software that puzzles through the markup tags on a page to extract information, any changes to the site's design can stop your program from working correctly.
Designing an XML Dialect
Although XML is described as a language and is compared with HTML, it's actually much larger in scope than that. XML is a markup language that defines how to define a markup language.
That's an odd distinction to make, and it sounds like the kind of thing you'd encounter in a philosophy textbook. This concept is important to understand, though, because it explains how XML can be used to define data as varied as health care claims, genealogical records, newspaper articles, and molecules.
The "X" in XML stands for Extensible, and it refers to organizing data for your own purposes. Data that's organized using the rules of XML can represent anything you want:
A programmer at a telemarketing company can use XML to store data on each outgoing call, saving the time of the call, the number, the operator who made the call, and the result.
A hobbyist can use XML to keep track of the annoying telemarketing calls she receives, noting the time of the call, the company, and the product being peddled.
A programmer at a government agency can use XML to track complaints about telemarketers, saving the name of the marketing firm and the number of complaints.
Each of these examples uses XML to define a new language that suits a specific purpose. Although you could call them XML languages, they're more commonly described as XML dialects or XML document types.
When a new XML dialect is created, the formal way to define it is to create a document type definition (DTD). This determines the rules that the data must follow to be considered valid in that dialect.
Listing 20.5 contains the DTD for the book database listed earlier.
Listing 20.5 The Full Text of librml.dtd
1: <!ELEMENT Library (Book?)+ >
2: <!ELEMENT Book (Author?, Title, PubDate?,
Publisher?, Subject?, Review?)* >
3: <!ELEMENT Author (#PCDATA)>
4: <!ELEMENT Title (#PCDATA)>
5: <!ELEMENT PubDate (#PCDATA)>
6: <!ATTLIST PubDate edition CDATA "" isbn CDATA "">
7: <!ELEMENT Publisher (#PCDATA)>
8: <!ELEMENT Subject (#PCDATA)>
9: <!ELEMENT Review (#PCDATA)>
In Listing 20.5, the XML file contained the following line:
<!DOCTYPE Library SYSTEM "librml.dtd">
The !DOCTYPE tag is used to identify the DTD that applies to the data. When a DTD is present, many XML tools can read XML created for that DTD and determine whether the data follows all the rules correctly. If it doesn't, it is rejected with a reference to the line that caused the error. This process is called validating the XML.
One thing you'll run into as you work with XML is data that has been structured as XML but wasn't defined using a DTD. This data can be parsed (presuming it's well-formed), so you can read it into a program and do something with it, but you can't check its validity to make sure that it's organized correctly according to the rules of its dialect.
Tip - To get an idea of what kind of XML dialects have been created, search the XML.org database at http://www.xml.org/xml/registry.jsp. The site includes industry news, developer resources, a conference calendar, frequently asked questions list, and many other subjects.
Next: Processing XML with Java and XOM >>
More Java Articles
More By Sams Publishing
|
This article is excerpted from chapter 20 of the book Sams Teach Yourself Java 2 in 21 Days, 4th Edition, written by Rogers Cadenhead and Laura Lemay (Sams; ISBN: 0672326280). Check it out today at your favorite bookstore. Buy this book now.
|
|