Introduction to XML
Posted 2007-02-08 23:50. Tagged markup, xml.
Here’s a short introduction to XML, as I wasn’t able to find one I liked with a quick google search. Feedback is welcome!
My goal here is to cover a more or less complete understanding of XML, but no particular application or use.
The basic concept of xml is to have plain text information, and put mark up around parts of it. The parts might be chapters, headlines, and paragraphs, or they might be product descriptions, order numbers, and prises. Or really anyting meaningfull.
It is often a good idea to keep the markup semantic, specifying what a part of the text is rather than how it is supposed to look (the looks of things can often be specified in a separate stylesheet).
Elements and tags
An element in xml is a start-tag, some content, and an end-tag. The content can contain text and/or other elements. Elements cannot overlap partially.
A tag is wrapped in the characters “<” and “>”, and the end tag is marked by a slash followed by the tag name. A foo element containg the word bar looks like this:
If an element has no content, the start-tag and the end-tag can be combined into an empty-element tag, where the slash marking the end of the element is put at the end of that tag. An empty foo element can look like this, note the trailing slash:
Some elements have attributes that specify extra metadata for the content. An abbreviation might for example have its meaning specified:
An attribute is allways specified as a name followed by an equals sign
and a quoted value (the plain ASCII double quote #34 (") or single
quote #39 ('), no
The document tree
An xml document must consist of exactly one root element and its contents. Since elements can’t overlap partially, each element (except the root node) has exactly one parent element. Hence, the entire document can be seen as a tree.
Attributes and text nodes are considered separate nodes, and the attributes of an element allways appears before the contents of the element. Thus the following simple document:
Hello, World! I am Rasmus .
… has the following document tree:
- element node text
- attribute node lang = en
- text node
Hello, World! I am
- element node name
- text node
- text node
- text node
The document tree is often a good way to think of a document, and should map directly to document semantics, even if some vocabularies (notably HTML) doesn’t enforce such a mapping.
There is two different modes for how whitespace is handled, default or preserve.
When preserving whitespace, each whitespace (plain space, tab, new line, carriage return) is handled as just that character.
default mode, each sequence of consecutive whitespace,
including space, horizontal tab, newline and carriage return, is
equivalent to a single whitespace.
There is a special attribute,
xml:space, that may be used to specify
the white space mode for the content of an element.
Characters and Entities
An xml file can be written using any character encoding, as long as it tells the reader which one. That is done on the first line of the file, which looks like this:
Both the version and the encoding has default values (for version,
1.1 is specied and 1.0 is the default, for encoding,
utf-8 is the default, but any encoding can be used) so the above is
Some characters have special meaning in xml. What if I want to express
the mathematical relation
x < 2 ? If I write it just like
that, the < will mark the beginning of a tag, and not the
than character. Luckily, there is something called a character
entity for this. Using that, I can write
and it will mean
x < 2
x < 2 (there is also a specific vocabulary for
mathematical statements, called MathML, but that is way beoynd
the scope of this introduction).
The character “
&” always marks the start of an entity reference
(except in a comment). It is followed by a name and
a semicolon. XML defines a few character entities by default.
There is also numerical character entities, which refers to Unicode
characters (regardless of the document encoding). They can be
&#NNN; in decimal or as
&#xNNN; in hexadecimal. For
A is a the letter A and
α is the greek letter α.
It is also possible to declare arbitrary named entities, but that requires using a DTD (see below) and isn’t used much today, except in legacy html documents.
A comment in xml starts with
<!-- and ends with
-->. It can
contain anything except two sequential dashes, including newlines and
commented out markup. Entitys is not recognized within comments.
<!-- This is a comment (which may contain < & >) -->
Comments can’t be nested, since a comment must not contain the string
--. A common workaround for that is to put a space between the
dashes in the inner comment:
<!-- a comment <!- - with a "nested" comment - -> end of comment: -->
A processing instruction is a special tag that tells a program reading the file what to do with it. It might for example tell it to use a specific xslt transform when presenting the document to a user.
A processing instruction, just like a tag, starts with a name. But
after that the syntax is freer, allowing anything but the string
(which terminates the processing instruction). Detailed syntax of
processing instructions can be specific to the toolchain that
Doctype and cdata
These are SGML legacy. A
DOCTYPE was used to tell the
parser which SGML vocabulary to expect, by linking to a
CDATA block was used for a block of verbatim
text without the normal SGML parsing. Both is possible
to use in the same way in XML, but neither is used much.
Several vocabularies might declare a tag with the same name but with different semantics and properties. To avoid confusion, each vocabulary can be declared in a name space of its own.
Specifying a name space
Each namespace is identified by an URI. In the simplest
case, a tag identifies the name space for itself and everything it
contains with the
In the following example, both the tags foo and bar are inside the name space http://my.domain/name.
Often, a document contains a mix of a few name spaces. It is of
course possible to use the above syntax, with a
xmlns attribute for
each element that is from a different namespace than its parent
element. However, that soon becomes extremely verbose. Then it is
more practical to declare a shortcut for each namespace and use that
on the elements.
Hello world ;
Note that when using this syntax, an unqualified name does not belong to the same namespace as that of its parent, but instead to the current default namespace (i.e. the namespace that is defined without a shortcut). The following example is equivalent to the previous one, the location element belongs to the http://my.domain/name namespace.
The name space URI
A namespace is identified by an URI, not by a resource
found by that URI. On the web, for example
http://www.kth.se/index.html may all identify the same resource
(either simply returning the same content, or redirecting to each
other). But if used as a namespace identityfiers, they would refer to
three separate namespaces.
A program processing the XML data has no need to fetch the actual resource identyfied by a name space URI, so useing a web uri does not require that the processor has network access. The name space URI does not even need to be specified useing a known protocol.
If you do point your web reader at a name space URI anyway, you will usually find a document specifying the semantics and syntax of the vocabulary.
A unique identifier for the element the attribute belongs to. The
value is required to be unique in the file. Note: Some vocabularies
id without the xml prefix for this.
The (natural) language of the content of the element the attribute belongs to. The value should be a registered language code.
An URI to use as the base when resolving relative URIs in the content of the element the attribute belongs to (regardless of link semantics). The value should be an URI.
Determins which kind of white-space handling should be done for the
content of the element the attribute belongs to. The value should be
See the section on whitespace.
See the section about name spaces.
Wether an xml file is correct or incorrect is a question that can be asked on serveral levels. On the most semantic level —wether the information given in the file is correct— correctnes obviously isn’t easily validated (those theoretically inclined is referred to Gödel’s incompleteness theorems). Luckily, there are also some correctness criteria that are easily formalized.
The most basic level of correctnes is that a document should be well-formed, meaning that the syntactic rules described above should be maintained. This level of correctnes can easily be validated, and there are many programs that does, including editors that revalidate automatically as content is typed.
The second level is that only declared tags are used, and only in
contexts and with contents that are formally allowed. For example, in
<tr> (table row) must not occur anywhere except
<table>. Given a formal description of the tag vocabulary,
this validation can also be easily automated.
There are three common ways to give such a definition of the tag set.
- DTD is the way inherited from SGML. Important only if you need to use formats or tools that doesent support the other ways.
- W3C Schema is one of the alternatives to replace DTDs.
- RelaxNG is the DTD replacement I prefer.
Now that you’ve read this little introduction, you might want to know
more about XML and how to use it. The authorative source
is of course the specification, which currently exist in the form of a
base specification and a set of extensions. These specifications are
all included in what I mean when I say
Then, to actually use what you’ve learned, you will need some tools. These are the main ones I use on a regular basis:
- http://www.thaiopensource.com/nxml-mode/ — My favorite xml mode for emacs. Does relax-ng validation contiously as the file is beeing typed.
- http://xmlsoft.org/XSLT/ — The XSLT C library for GNOME, includes the command line tool xsltproc.
Thats all for now! In the next installment (if any) I might write more on my way of implementing a CMS.