The basic concept of xml is to have plain text information, and put mark up around parts of it. The parts might be chapters, headlines, and paragraphs, or they might be product descriptions, order numbers, and prises. Or really anyting meaningfull.
It is often a good idea to keep the markup semantic, specifying what a part of the text is rather than how it is supposed to look (the looks of things can often be specified in a separate stylesheet).
Elements and tags
An element in xml is a start-tag, some content, and an end-tag. The content can contain text and/or other elements. Elements cannot overlap partially.
A tag is wrapped in the characters
>, and the end tag
is marked by a slash followed by the tag name. A foo element containg
the word bar looks like this:
If an element has no content, the start-tag and the end-tag can be combined into an empty-element tag, where the slash marking the end of the element is put at the end of that tag. An empty foo element can look like this, note the trailing slash:
Some elements have attributes that specify extra metadata for the content. An abbreviation might for example have its meaning specified:
<abbr title="eXtensible Markup Language">XML</abbr>
An attribute is allways specified as a name followed by an equals
sign and a quoted value (the plain ASCII double quote #34
(") or single quote #39 ('), no
The document tree
An xml document must consist of exactly one root element and its contents. Since elements can't overlap partially, each element (except the root node) has exactly one parent element. Hence, the entire document can be seen as a tree.
Attributes and text nodes are considered separate nodes, and the attributes of an element allways appears before the contents of the element. Thus the following simple document:
<text lang="en">Hello, World! I am <name>Rasmus</name>.</text>
... has the following document tree:
element node text
attribute node lang = en
Hello, World! I am
element node name
The document tree is often a good way to think of a document, and should map directly to document semantics, even if some vocabularies (notably HTML) doesn't enforce such a mapping.
There is two different modes for how whitespace is handled, default or preserve.
When preserving whitespace, each whitespace (plain space, tab, new line, carriage return) is handled as just that character.
default mode, each sequence of consecutive
whitespace, including space, horizontal tab, newline and carriage
return, is equivalent to a single whitespace.
There is a special attribute,
xml:space, that may be
used to specify the white space mode for the content of an
Characters and Entities
An xml file can be written using any character encoding, as long as it tells the reader which one. That is done on the first line of the file, which looks like this:
<?xml version="1.0" encoding="utf-8"?>
Both the version and the encoding has default values (for
1.1 is specied and 1.0
is the default, for
utf-8 is the default, but any encoding can be
used) so the above is equivalent to:
Some characters have special meaning in xml. What if I want to
express the mathematical relation
x < 2 ? If I write
it just like
that, the < will mark the beginning of a tag, and not the
less than character. Luckily, there is something called
character entity for this. Using that, I can
and it will mean
x < 2
x < 2 (there is also a specific vocabulary for
mathematical statements, called MathML,
but that is way beoynd the scope of this introduction).
& allways marks the start of an
reference (except in a comment).
It is followed by a name and a semicolon. XML defines a few character
entities by default.
There is also numerical character entities, which refers to
Unicode characters (regardless of the document encoding).
They can be expressed as
&#NNN; in decimal
&#xNNN; in hexadecimal. For example
A is a the letter A and
α is the greek letter α.
It is also possible to declare arbitrary named entities, but that requires using a DTD (see below) and isn't used much today, except in legacy html documents.
A comment in xml starts with
<!-- and ends with
-->. It can contain anything except two sequential dashes,
including newlines and commented out markup. Entitys is not recognized
<!-- This is a comment (which may contain < & >) -->
Comments can't be nested, since a comment must not contain the
--. A common workaround for that is to put a space
between the dashes in the inner comment:
<!-- a comment <!- - with a "nested" comment - -> end of comment: -->
A processing instruction is a special tag that tells a program reading the file what to do with it. It might for example tell it to use a specific xslt transform when presenting the document to a user.
<?name and stuff?>
A processing instruction, just like a tag, starts with a
name. But after that the syntax is freer, allowing anything but
?> (which terminates the processing
instruction). Detailed syntax of processing instructions can be
specific to the toolchain that processes it.
Doctype and cdata
These are SGML legacy. A
was used to tell the parser which SGML vocabulary to
expect, by linking to a DTD. A
block was used for a block of verbatim text without the normal
SGML parsing. Both is possible to use in the same way
in XML, but neither is used much.
Several vocabularies might declare a tag with the same name but with different semantics and properties. To avoid confusion, each vocabulary can be declared in a name space of its own.
Specifying a name space
Each namespace is identified by an URI. In the
simplest case, a tag identifies the name space for itself and everything
it contains with the
In the following example, both the tags foo and bar are inside the name space http://my.domain/name.
<?xml?> <foo xmlns="http://my.domain/name"> <bar>hello</bar> </foo>
Often, a document contains a mix of a few name spaces. It is of
course possible to use the above syntax, with a
xmlns attribute for each element that is from a
different namespace than its parent element. However, that soon becomes
extremely verbose. Then it is more practical to declare a shortcut for
each namespace and use that on the elements.
<x:foo xmlns:x="http://my.domain/name" xmlns:y="http://some.other/markup"> <y:exclamation> Hello <x:location>world</x:location> </y:exclamation> </x:foo>
Note that when using this syntax, an unqualified name does not belong to the same namespace as that of its parent, but instead to the current default namespace (i.e. the namespace that is defined without a shortcut). The following example is equivalent to the previous one, the location element belongs to the http://my.domain/name namespace.
<foo xmlns="http://my.domain/name" xmlns:y="http://some.other/markup"> <y:exclamation> Hello <location>world</location> </y:exclamation> </x:foo>
The name space URI
A namespace is identified by an URI, not by
a resource found by that URI. On the web, for
http://www.kth.se/index.html may all identify the same
resource (either simply returning the same content, or redirecting to
each other). But if used as a namespace identityfiers, they would
refer to three separate namespaces.
A program processing the XML data has no need to fetch the actual resource identyfied by a name space URI, so useing a web uri does not require that the processor has network access. The name space URI does not even need to be specified useing a known protocol.
If you do point your web reader at a name space URI anyway, you will usually find a document specifying the semantics and syntax of the vocabulary.
A unique identifier for the element the attribute belongs to.
The value is required to be unique in the file.
Note: Some vocabularies use
id without the
xml prefix for this.
The (natural) language of the content of the element the attribute belongs to. The value should be a registered languaga code.
An URI to use as the base when resolving relative URIs in the content of the element the attribute belongs to (regardless of link semantics). The value should be an URI.
Determins which kind of white-space handling should be done for
the content of the element the attribute belongs to. The value should
See the section on
Wether an xml file is correct or incorrect is a question that can be asked on serveral levels. On the most semantic level —wether the information given in the file is correct— correctnes obviously isn't easily validated (those theoretically inclined is referred to Gödel's incompleteness theorems). Luckily, there are also some correctness criteria that are easily formalized.
The most basic level of correctnes is that a document should be well-formed, meaning that the syntactic rules described above should be maintained. This level of correctnes can easily be validated, and there are many programs that does, including editors that revalidate automatically as content is typed.
The second level is that only declared tags are used, and only in
contexts and with contents that are formally allowed. For example, in
<tr> (table row) may not occur anywhere
except inside a
<table>. Given a formal description of the tag
vocabulary, this validation can also be easily automated.
There are three common ways to give such a definition of the tag set.
A DTD is the way inherited from SGML. Important only if you need to use formats or tools that doesent support the other ways.
W3C Schema is one of the alternatives to replace DTDs.
RelaxNG is the DTD replacement I prefer.
Now that you've read this little introduction, you might want to
know more about XML and how to use it. The authorative
source is of course the specification, which currently exist in the form
of a base specification and a set of extensions. These specifications
are all included in what I mean when I say
Then, to actually use what you've learned, you will need some tools. These are the main ones I use on a regular basis:
http://www.thaiopensource.com/nxml-mode/ — My favorite xml mode for emacs. Does relax-ng validation contiously as the file is beeing typed.
http://xmlsoft.org/XSLT/ — The XSLT C library for GNOME, includes the command line tool xsltproc.
Thats all for now! In the next installment (if any) I might write more on my way of implementing a CMS.