Still having trouble understanding what XML is? In this article Zaid provides a quick primer in which he breaks down an XML document, detailing what each part is as he goes. A must read for the XML beginner.
The Makeup of an XML Document A Quick Primer - XML and Parsing XML Documents (Page 2 of 3 )
An XML document is a tagged data file. The tags in an XML document define the structures and boundaries of the embedded data elements. The syntax of the tags is very similar to that of HTML. Parsing XML simply means retrieving data from an XML document based on its meaning and structure.
Listed below is a sample XML document that contains a mail message:
//mail.xml <?xml version="1.0"?> <IDOCTYPE mail SYSTEM "mail.dtd" [ <IENTITY from "email@example.com"> <IENTITY to "firstname.lastname@example.org"> <IENTITY cc "email@example.com"> ]> <mail> <From> &from; </From> <To> &to; </To> <Cc> &cc; </Cc> <Date>Fri, 12 Jan 2001 10:21:56 -0600</Date> <Subject>XML and Parsing XML Documents </Subject> <Body language="english"> An XML document is a tagged data file. The tags in an XML document define the structures and boundaries of the embedded data elements. <Signature> Zaid &from; http://www.devarticles.com </Signature> </Body> </mail>
In general, there are four main components associated with an XML document: elements, attributes, entities, and DTD's.
An element is something that describes a piece of data. An element is comprised of markup tags and the element's content. The following is an element in listed above XML file (mail.xml):
<Subject> XML and Parsing XML Documents </Subject>
It contains a start tag, <Subject>, the content XML parsers for J2ME MIDP, and an end tag, </Subject>.
An attribute is used in an element to provide additional information about the element. It usually resides inside the start tag of an element. In the following example, language is an attribute of the element Body that describes the language used in the message body:
An entity is a virtual storage of a piece of data (either text data or binary data) that you can reference in an XML document. Entities can be further categorized into internal entities and external entities. An internal entity is defined inside an XML document and doesn't reference any outside content. For example, "from" is an internal entity defined in our XML file above:
<IENTITY from "firstname.lastname@example.org">
The entity "from" is referenced later on in the XML document as &from;. When the XML document is parsed, the parser simply replaces the entity with its actual value: from@from .com.
An external entity refers to content outside an XML document. Its content is usually a filename or a URL proceeded with a SYSTEM or PUBLIC identifier. SYSTEM means that the filename exists on the local PC. PUBLIC means that the file can be accessed online, usually being prefixed with "http://". The following is an example of an external entity, iconimage, that references a local file called icon.png:
<IENTITY iconimage SYSTEM "icon.png" NDATA png>
A Document Type Definition (DTD) is an optional portion of XML that defines the allowable structure for a particular XML document. Think of DTD as the roadmap or rulebook of the XML document. The code listed below shows the DTD definition for the XML file (mail.xml) listed above:
// mail. dtd <IELEMENT mail (From, To, Cc, Date, Subject, Body)> <IELEMENT From (#PCDATA)> <IELEMENT To (#PCDATA)> <IELEMENT Cc (#PCDATA)> <IELEMENT Date (#PCDATA)> <IELEMENT Subject (#PCDATA)> <IELEMENT Signature (#PCDATA)> <IELEMENT Body (#PCDATAISignature)+>
This DTD basically says that the element called mail contains six sub-elements: From, To, Cc, Date, Subject, and Body. The term #PCDATA refers to the "Parsed Character Data," which indicates that an element can contain only text. The last line of the DTD definition indicates that the element Body could contain mixed contents that include text, sub-element Signature, or both.
Event-Based XML Parser Versus Tree-Based XML Parser There are 2 types of interfaces available for parsing XML documents: the event-based interface, and the tree-based interface.
Event-Based XML Parsers An event-based XML parser reports parsing events directly to the application through callback methods. It provides a serial-access mechanism for accessing XML documents. Applications that use a parser's event-based interface need to implement the interface's event handlers to receive parsing events.
The Simple API for XML (SAX) is an industry standard event-based interface for XML parsing. The SAX 1.0 Java API defines several callback methods in one of its interface classes. The applications need to implement these callback methods to receive parsing events from the parser. For example, the startElement ( ) is one of these callback methods. When a SAX parser reaches the start tag of an element, the application that implements the parser's startElement ( ) method will receive the event. It will also receive the tag name through one of the method's parameters.
Tree-Based XML Parsers A tree-based XML parser reads an entire XML document into an internal tree structure in memory. Each node of the tree represents a piece of data from the original document. This method allows an application to navigate and manipulate the parsed data quickly and easily.
The Document Object Model (DOM) is an industry standard tree-based interface for XML parsing. A DOM parser can be very memory and CPU intensive, because it keeps the whole data structure in memory. A DOM parser may arise performance issues for your wireless applications, especially when the XML document to be parsed is large and complex.
In general, SAX parsers are faster and consume less CPU and memory than DOM parsers. However, SAX parsers allow only serial access to the XML data. A DOM parsers' tree-structured data is easier to access and manipulate. SAX parsers are often used by Java servlets or network oriented programs to transmit and receive XML documents in a fast and efficient fashion. DOM parsers are often used for manipulating XML documents that exist physically, such as a configuration file or an already saved order.