Tanzarine Company Logo
Tanzarine Technology

 

Introduction to XML

This document attempts to give a brief, whistle-stop tour of XML, what it is, what you can do with it, and what tools are built around it. It is loosely based on a document originally produced in response to a client request.


Introduction

What does it stand for?

What does it mean?

Is it like HTML then?

Why use it for data transfer?

Why do I need to transform it?

How do I tell others what tags are valid in my documents?

How do XML newsfeeds work?

Can my current applications offer XML interfaces?

Further reading

Introduction

This document is about XML. It attempts to do something very foolish, it attempts to give a brief, whistle-stop tour of XML, what it is, what you can do with it, and what tools are built around it. The reason this is a foolish thing to do is that it is almost impossible to do this completely and briefly.

Therefore at various stages of this document, phrases like "beyond the scope" and "actually, that's not quite right, as we'll see later" will appear. Remember that the aim is to provide the overview. If you want the gory details, you'll have to go off somewhere and find them. Although there'll be plenty of suggestions of where you can go to find them.


What does it stand for?

XML stands for eXtensible Markup Language. This means pretty much what it says, it's a markup language, which means that it's a way to present data using tags to indicate the meaning or usage, and it's extensible, which means that at a fundamental level nobody tells you what tags to use; in fact, you tell them.


What does it mean?

Here is an example XML document. It misses out some of the fiddly control elements for telling XML parsers what's going on, but we don't need to worry about that here:

<dbfetch>
  <code>760001234</code>
</dbfetch>
And that's basically it. Whilst there are lots of other things we can do, the basics are just data surrounded by tags.


Is it like HTML then?

HTML stands for HyperText Markup Language. It has been used to create web pages since the early 90's. Both HTML and XML are derived from a much older and more complex markup language called SGML. We don't need to worry about that here.

Although XML is similar to HTML, there are some subtle differences. HTML is only ever interpreted by web browsers, and they are free to parse the document in pretty much any way they feel like. Hence HTML isn't very strict about the document structure. XML, however, is, because the document may have to be processed by any number of (initially ignorant) applications. These are the main differences:


Attributes must be enclosed in quotes

In HTML, you can get away with this:

<table width=100 border=0>
In XML, you cannot, and must say this:
<table width="100" border="0">


All elements(tags) must be closed

In HTML, you will frequently see this:

<br>
<img src="images/logo.gif">
These tags are not closed, in that there is no </br> or </img> following the opening tags. In XML, these must appear. Fortunately, a shortcut is allowed for, like this:
<br/>
<img src="images/logo.gif"/>
Note the "/" at the end of the tag.


Case matters

In HTML, the following are equivalent:

<html>
<HTML>
In XML, these are regarded as being different. Note that there are various guidelines in place, as with Java programming, as to what case should be used when.


Why use it for data transfer?

To explain why XML is so good for this, we'll look at a couple of examples of getting data from A to B, and then see how XML can make life easier.

Here's the data we want to transfer:

Sales Breakdown
Widget Name Part Number % of all Sales
Spanner 1 25
Hammer 2 25
Wrench 3 10
Screwdriver 4 40

What do we know about the data? Well, we can see from the title that it's a sales breakdown, and from the table we can see that we make a number of widgets, and have recorded what percentage of our total sales each represents.


Using HTML

Let's transfer this data from our server to a web browser using HTML:

<table>
  <tr>
    <th colspan="3">
      Sales Breakdown
    </th>
  </tr>
  <tr>
    <th>
      Widget Name
    </th>
    <th>
      Part Number
    </th>
    <th>
      % of all Sales
    </th>
  </tr>
  <tr>
    <td>
      Spanner
    </td>
    <td>
      1
    </td>
    <td>
      25
    </td>
  </tr>
  <tr>
    <td>
      Hammer
    </td>
    <td>
      2
    </td>
    <td>
      25
    </td>
  </tr>
  <tr>
    <td>
      Wrench
    </td>
    <td>
      3
    </td>
    <td>
      10
    </td>
  </tr>
  <tr>
    <td>
      Screwdriver
    </td>
    <td>
      4
    </td>
    <td>
      40
    </td>
  </tr>
</table>

Whilst this is fairly straightforward, and will produce the table in the browser window as we would expect, what if we wanted to replace the browser client with one which would read the sales figures and use them, say, to produce a pie chart? Our program would need to know where to look in the document to fetch all the pieces, and it would need to know some details about the document to make sure it really was a sales breakdown (or at least a set of figures which could be used for a pie chart) before it started.

We could confuse the issue by changing the title presentation so that it is outside the table, e.g.

<h1>
  Sales Breakdown
</h1>
If our program was still looking for a <th colspan="3"> to contain the title, it would be rather disappointed. The same argument applies to all the other data items.


Using CSV

Let's now explore what happens when we transfer this data between two applications, A and B, using a common transfer format, CSV.

Sales Breakdown
Spanner,1,25
Hammer,2,25
Wrench,3,10
Screwdriver,4,40
This format is not burdened with all the display particulars of the HTML equivalent, and it will be much easier for the programmer working on application B to extract the data and use it. In this case, the fields are fairly intuitive, but this might not always be the case. Indeed, what happens if the programmer on application A changes it so that it now produces:
Sales Breakdown
1,25,Spanner
2,25,Hammer
3,10,Wrench
4,40,Screwdriver
Application B will now fail, until the programmer realises, or the programmer on application A phones him to tell him that he's changed things around.

More formally, such transfer mechanisms are dependent on an independently documented data format, which must be agreed by all parties.


Using XML

Let's look immediately an equivalent transfer scenario using XML, once again we'll remove some of the control elements to concentrate on the data itself.

<SalesBreakdown>
  <widget>
    <name>
      Spanner
    </name>
    <part>
      1
    </part>
    <percentage>
      25
    </percentage>
  </widget>
  <widget>
    <name>
      Hammer
    </name>
    <part>
      2
    </part>
    <percentage>
      25
    </percentage>
  </widget>
  <widget>
    <name>
      Wrench
    </name>
    <part>
      3
    </part>
    <percentage>
      10
    </percentage>
  </widget>
  <widget>
    <name>
      Screwdriver
    </name>
    <part>
      4
    </part>
    <percentage>
      40
    </percentage>
  </widget>
</SalesBreakdown>
Straight away we can see that this is much more verbose than the previous examples. However, we can still inspect the document to see how it is structured, but more importantly, because the tags all form a tree structure, a basic program can be used to extract the data without knowing anything about the data itself.

The programmer on application B can use XML libraries for Java such as JDOM in order to write processing programs for the data, or if display to a browser is required, an XML transformation can be used for producing the necessary HTML.


Why do I need to transform it?

As mentioned above, for our notional application B, we might want to take the data received and then display it in a browser. We start with the XML document in section 4.3, but would like to end up with the HTML in section 4.1, and hence we now have a requirement to transform our received document into a new format.

Fortunately, this problem is addressed by XSLT, which stands for XSL Transform. XSL stands for eXtensible Stylesheet Language. Much as CSS tells a browser what to do with HTML tags in a document, XSLT tells a "transformer engine" what to do with XML tags in a document.

Here is some XSLT which will transform our XML data into the HTML table.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:template match="SalesBreakdown">
<table>
  <tr>
    <th colspan="3">
      Sales Breakdown
    </th>
  </tr>
  <tr>
    <th>
      Widget Name
    </th>
    <th>
      Part Number
    </th>
    <th>
      % of all Sales
    </th>
  </tr>
  <xsl:apply-templates />
</table>
</xsl:template>

<xsl:template match="widget">
<tr>
  <xsl:apply-templates />
</tr>
</xsl:template>

<xsl:template match="name">
<td>
  <xsl:value-of select="." />
</td>
</xsl:template>

<xsl:template match="part">
<td>
  <xsl:value-of select="." />
</td>
</xsl:template>

<xsl:template match="percentage">
<td>
  <xsl:value-of select="." />
</td>
</xsl:template>

</xsl:stylesheet>
The first thing to notice is that this is an XML document. However, there are two types of tags, one which defines the transformation operations (and these all start with xsl:) and the other which is HTML tags for use in our final document (these are normal table tags). Although there are HTML tags in this document, the document must be well formed, because it is just another XML document in the eyes of a parser. That's why we have to make sure that the colspan attribute has quotes around its value.

Unlike previous documents, we've left in a little bit of the header detail, because it's important. We've said that the transformation operations all start with xsl:, why is this? In fact, there's no reason at all, it could be anything, but the convention for XSLT is to use xsl:. Given that it could be anything, how do we tell? This is determined by the xmlns:xsl attribute in the first tag. This defines a "namespace" and the value of the attribute tells us what the namespace is that we wish to associate with the xsl: prefix, in this case it is XSLT.

So, the transformer engine will take its instructions from the tags starting with xsl:. Everything else it will not be interested in, so all tags and text will just be copied to the transformed document.

How does XSLT work? Well, in this example, we've used a number of discrete templates, the first of which matches "SalesBreakdown", so the transformer will, on encountering a <SalesBreakdown> tag in the source document, output everything in the template. However, this also includes an "apply-templates" directive, which tells the transformer to carry on through the source document looking for other tags which match other templates within the XSLT.

The next tag in the source document is a <widget> tag, and there is a template for that, which gets processed. Notice that this causes a new row to be created in the table, but then there's another "apply-templates" directive, so the transformer continues through the source document.

The other tags are handled in a similar fashion by dedicated templates, the only extra point to notice is that when we reach our tags with data stored in them, there is no "apply-templates" directive, because there's nothing else to do. However, we would like to output the data into the final document, and that's what "value-of" does. The "select" attribute tells the transformer what value to output, and in this case "." means the data stored in the current tag.

There are a couple of different ways we could define this transformation, but we won't look into streamlining it here, because we're more concerned with the general principle of transforming.


How do I tell others what tags are valid in my documents?

In part 5 we looked at the case where our imaginary application receives an XML document, and transforms it into HTML using XSLT. What happens if the application receives an XML document such as the following:

<UselessDocument>
  <title>
    Well, this is a waste
  </title>
</UselessDocument>
This document does not contain any of the tags expected by our application, and applying the transform will result in no output. However, it would be much better if we could check the incoming document and make sure it has the "right stuff" before we start.

Fortunately we can use XML Schemas to define what constitutes a valid document, and use that schema both to tell others what they should put into the document before they start, and to check incoming documents to ensure they are correct before we start processing.

Like XSLT, Schemas are also defined in XML documents, they define what tags are allowed in the document, what attributes they can have, which items are mandatory and optional, and many other things besides.

Here is a sample schema for our application's XML document. Also like the XSLT example, this isn't the only way we can implement a schema, but it gives an indication of the kind of things we can do.

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">

  <xsd:element name="SalesBreakdown">
    <xsd:complexType>
      <xsd:element name="widget" minOccurs="0" maxOccurs="unbounded">
        <xsd:complexType>
   <xsd:element name="name" type="xsd:string"/>
   <xsd:element name="part" type="xsd:positiveInteger"/>
   <xsd:element name="percentage">
     <xsd:simpleType base="xsd:integer">
       <xsd:minInclusive value="0" />
       <xsd:maxInclusive value="100" />
     </xsd:simpleType>
   </xsd:element>
 </xsd:complexType>
      </xsd:element>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>
The last time we looked at an XML document for helping us with our own XML document, we noted the presence of xsl: in the tags. Here you'll see xsd:, and this is defined at the top of the document in a similar way, and tells us that we are dealing with XML schema.

Note from the schema that we can define not only the tag structure of our document, but also what data types can appear within them. In fact, the schema language allows us to do much more than this, we can also specify precisely how many times a tag can appear in a certain place, and for any data, the precise values which are allowed.

This is how we can define the percentage tag above and restrict the range of values allowed.

However, one thing we cannot do is stipulate that the sum of all the values stored in the percentage tags in the document must add up to 100.


How do XML newsfeeds work?

These are a good example of XML in action on the web. An increasing number of websites now offer a precis of their news articles and other content in an XML format. These can be viewed directly in a browser which is XML-capable, or can be downloaded by portal applications and processed for further consumption by other applications.

XML is important for doing this because the possible different destinations, browsers, applications, or the new "wireless" devices. Each has different requirements, but the source data can be the same.

One popular format is called "RSS" which stood initially for "RDF Site Summary", (and in turn "RDF" stands for "Resource Discovery Framework") and now just for "Rich Site Summary". This was initially created by Netscape for use on their MyNetscape portal.

This format allows information about news items and other published content to be summarised in XML, and here is an example taken from the xml.com web site:

<!DOCTYPE rss PUBLIC
  "-//Netscape Communications//DTD RSS 0.91//EN"
  "http://my.netscape.com/publish/formats/rss-0.91.dtd"
>

<rss version="0.91">
  <channel>
    <title>XML.com</title>
    <description>
      XML.com features a rich mix of information and services for the XML
community.
    </description>
    <language>en-us</language>
    <link>http://www.xml.com/</link>
    <copyright>Copyright 2000, O'Reilly and Associates</copyright>
    <managingEditor>edd@xml.com (Edd Dumbill)</managingEditor>
    <webMaster>peter@xml.com (Peter Wiggin)</webMaster>
    <image>
      <link>http://www.xml.com/</link>
      <url>http://www.xml.com/universal/images/xml_tiny.gif</url>
      <title>XML.com</title>
    </image>

    <item>
      <title>Around and About at XML Europe 2001</title>
      <link>http://www.xml.com/pub/a/2001/05/23/about.html</link>
      <description>Pictures and notes from the GCA's XML Europe 2001
conference.</description>
    </item>

    <item>
      <title>Using the Jena API to Process RDF</title>
      <link>http://www.xml.com/pub/a/2001/05/23/jena.html</link>
      <description>Jena is a freely-available Java API for processing RDF.
This article provides an introduction to the API and its
implementation.</description>
    </item>

</channel>
</rss>
Note that the document contains one RSS channel, within which there is a variety of information about the channel itself, such as who looks after it, followed by a number of content items (here we've reduced the list to make it more readable).

It's also important to note the DOCTYPE declaration in the document, because this leads us on to DTDs, which we haven't covered yet. These are an earlier version of "XML tag definitions" which has been superseded in many respects by XML Schema. However, many documents still use DTDs as in some cases they can be simpler to maintain.

An XML parser will use the link to the DTD to retrieve the DTD, analyse its data, and then use the rules contained within it to validate the RSS document.


Can my current applications offer XML interfaces?

Any application which is operating in a web environment should be capable of offering XML interfaces to its content and services. Certainly any application which produces web pages should definitely be able to produce XML output, and the preferred scenario is for it to produce XML as its primary output format, and then use XLST to transform the XML into the HTML expected by the browser. Other applications are then free to interrogate the XML data directly, and process it as they require.

Applications which do not currently operate in a web environment would need to be migrated to some environment which offered easy exchange of the XML data with other applications. Although this could be done via data files and FTP, or direct socket I/O, use of a web environment, where XML is transmitted over HTTP, offers the most flexibility and compatibility with other applications.


Further reading

There are many sources of information on XML and associated technologies. There are many books on these subjects, however the field changes so rapidly that online resources often offer a better insight.

XML.com contains a wealth of articles, product reviews, tools and links. In particular, there is a good introduction to RSS

All of the major manufacturers have web sites dedicated to XML and its usage within their products and environments. Sun. Sun's Java home page, Microsoft and IBM are all good places to start.

Other sites which may of interest as well are xml.org which is hosted by Oasis, and also xmlhack

And perhaps most importantly, the W3C maintains the list of XML specifications, which can be hard going, but also a variety of other useful XML resources.

Copyright ©2008 Tanzarine Technology Ltd