Introduction to XML

Posted 2007-02-08 23:50. Tagged markup, xml.

Please note that this post is 17 years old. The information herein may be outdated.

Here’s a short introduction to XML, as I wasn’t able to find one I liked with a quick google search. Feedback is welcome!

My goal here is to cover a more or less complete understanding of XML, but no particular application or use.

The basic concept of xml is to have plain text information, and put mark up around parts of it. The parts might be chapters, headlines, and paragraphs, or they might be product descriptions, order numbers, and prises. Or really anyting meaningfull.

It is often a good idea to keep the markup semantic, specifying what a part of the text is rather than how it is supposed to look (the looks of things can often be specified in a separate stylesheet).

Elements and tags

An element in xml is a start-tag, some content, and an end-tag. The content can contain text and/or other elements. Elements cannot overlap partially.

A tag is wrapped in the characters “<” and “>”, and the end tag is marked by a slash followed by the tag name. A foo element containg the word bar looks like this:

<foo>bar</foo>

If an element has no content, the start-tag and the end-tag can be combined into an empty-element tag, where the slash marking the end of the element is put at the end of that tag. An empty foo element can look like this, note the trailing slash:

<foo/>

Attributes

Some elements have attributes that specify extra metadata for the content. An abbreviation might for example have its meaning specified:

<abbr title="eXtensible Markup Language">XML</abbr>

An attribute is allways specified as a name followed by an equals sign and a quoted value (the plain ASCII double quote #34 (") or single quote #39 ('), no fancy quotes).

The document tree

An xml document must consist of exactly one root element and its contents. Since elements can’t overlap partially, each element (except the root node) has exactly one parent element. Hence, the entire document can be seen as a tree.

Attributes and text nodes are considered separate nodes, and the attributes of an element allways appears before the contents of the element. Thus the following simple document:

<text lang="en">Hello, World!  I am <name>Rasmus</name>.</text>

… has the following document tree:

element node text
- attribute node lang = en
- text node Hello, World! I am
- element node name
  - text node Rasmus
- text node .

The document tree is often a good way to think of a document, and should map directly to document semantics, even if some vocabularies (notably HTML) doesn’t enforce such a mapping.

Text

Whitespace

There is two different modes for how whitespace is handled, default or preserve.

When preserving whitespace, each whitespace (plain space, tab, new line, carriage return) is handled as just that character.

When in default mode, each sequence of consecutive whitespace, including space, horizontal tab, newline and carriage return, is equivalent to a single whitespace.

There is a special attribute, xml:space, that may be used to specify the white space mode for the content of an element.

Characters and Entities

An xml file can be written using any character encoding, as long as it tells the reader which one. That is done on the first line of the file, which looks like this:

<?xml version="1.0" encoding="utf-8"?>

Both the version and the encoding has default values (for version, only 1.0 and 1.1 is specied and 1.0 is the default, for encoding, utf-8 is the default, but any encoding can be used) so the above is equivalent to:

<?xml?>

Some characters have special meaning in xml. What if I want to express the mathematical relation x < 2 ? If I write it just like that, the < will mark the beginning of a tag, and not the less than character. Luckily, there is something called a character entity for this. Using that, I can write x < 2 and it will mean x < 2 (there is also a specific vocabulary for mathematical statements, called MathML, but that is way beoynd the scope of this introduction).

Predefined character entities in xml.
Entity	Value		Decimal ref
`&`	&	ampersand	`&`
`<`	<	less than	`<`
`>`	>	Greater than	`>`
`"`	"	quotation	`"`
`'`	'	apostrophe	`'`

The character “&” always marks the start of an entity reference (except in a comment). It is followed by a name and a semicolon. XML defines a few character entities by default.

There is also numerical character entities, which refers to Unicode characters (regardless of the document encoding). They can be expressed as &#NNN; in decimal or as &#xNNN; in hexadecimal. For example A is a the letter A and α is the greek letter α.

It is also possible to declare arbitrary named entities, but that requires using a DTD (see below) and isn’t used much today, except in legacy html documents.

Other markup

Comments

A comment in xml starts with . It can contain anything except two sequential dashes, including newlines and commented out markup. Entitys is not recognized within comments.

<!-- This is a comment (which may contain < & >) -->

Comments can’t be nested, since a comment must not contain the string --. A common workaround for that is to put a space between the dashes in the inner comment:

<!-- a comment
<!- - with a "nested" comment - ->
end of comment: -->

Processing instructions

A processing instruction is a special tag that tells a program reading the file what to do with it. It might for example tell it to use a specific xslt transform when presenting the document to a user.

<?name and stuff?>

A processing instruction, just like a tag, starts with a name. But after that the syntax is freer, allowing anything but the string ?> (which terminates the processing instruction). Detailed syntax of processing instructions can be specific to the toolchain that processes it.

Doctype and cdata

These are SGML legacy. A DOCTYPE was used to tell the parser which SGML vocabulary to expect, by linking to a DTD. A CDATA block was used for a block of verbatim text without the normal SGML parsing. Both is possible to use in the same way in XML, but neither is used much.

Name spaces

Several vocabularies might declare a tag with the same name but with different semantics and properties. To avoid confusion, each vocabulary can be declared in a name space of its own.

Specifying a name space

Each namespace is identified by an URI. In the simplest case, a tag identifies the name space for itself and everything it contains with the xmlns attribute.

In the following example, both the tags foo and bar are inside the name space http://my.domain/name.

<foo xmlns="http://my.domain/name">
  <bar>hello</bar>
</foo>

Often, a document contains a mix of a few name spaces. It is of course possible to use the above syntax, with a xmlns attribute for each element that is from a different namespace than its parent element. However, that soon becomes extremely verbose. Then it is more practical to declare a shortcut for each namespace and use that on the elements.

<x:foo xmlns:x="http://my.domain/name"
       xmlns:y="http://some.other/markup">
  <y:exclamation>
    Hello <x:location>world</x:location>;
  </y:exclamation>
</x:foo>

Note that when using this syntax, an unqualified name does not belong to the same namespace as that of its parent, but instead to the current default namespace (i.e. the namespace that is defined without a shortcut). The following example is equivalent to the previous one, the location element belongs to the http://my.domain/name namespace.

<foo xmlns="http://my.domain/name"
     xmlns:y="http://some.other/markup">
  <y:exclamation>
    Hello <location>world</location>
  </y:exclamation>
</x:foo>

The name space URI

A namespace is identified by an URI, not by a resource found by that URI. On the web, for example http://www.kth.se/, https://www.kth.se and http://www.kth.se/index.html may all identify the same resource (either simply returning the same content, or redirecting to each other). But if used as a namespace identityfiers, they would refer to three separate namespaces.

A program processing the XML data has no need to fetch the actual resource identyfied by a name space URI, so useing a web uri does not require that the processor has network access. The name space URI does not even need to be specified useing a known protocol.

If you do point your web reader at a name space URI anyway, you will usually find a document specifying the semantics and syntax of the vocabulary.

Special attributes

`xml:id`

A unique identifier for the element the attribute belongs to. The value is required to be unique in the file. Note: Some vocabularies use id without the xml prefix for this.

`xml:lang`

The (natural) language of the content of the element the attribute belongs to. The value should be a registered language code.

`xml:base`

An URI to use as the base when resolving relative URIs in the content of the element the attribute belongs to (regardless of link semantics). The value should be an URI.

`xml:space`

Determins which kind of white-space handling should be done for the content of the element the attribute belongs to. The value should be either default or preserve. See the section on whitespace.

`xmlns` or `xx:xmlns`

See the section about name spaces.

Correctness

Wether an xml file is correct or incorrect is a question that can be asked on serveral levels. On the most semantic level —wether the information given in the file is correct— correctnes obviously isn’t easily validated (those theoretically inclined is referred to Gödel’s incompleteness theorems). Luckily, there are also some correctness criteria that are easily formalized.

The most basic level of correctnes is that a document should be well-formed, meaning that the syntactic rules described above should be maintained. This level of correctnes can easily be validated, and there are many programs that does, including editors that revalidate automatically as content is typed.

The second level is that only declared tags are used, and only in contexts and with contents that are formally allowed. For example, in HTML a <tr> (table row) must not occur anywhere except inside a <table>. Given a formal description of the tag vocabulary, this validation can also be easily automated.

There are three common ways to give such a definition of the tag set.

DTD is the way inherited from SGML. Important only if you need to use formats or tools that doesent support the other ways.
W3C Schema is one of the alternatives to replace DTDs.
RelaxNG is the DTD replacement I prefer.

Now that you’ve read this little introduction, you might want to know more about XML and how to use it. The authorative source is of course the specification, which currently exist in the form of a base specification and a set of extensions. These specifications are all included in what I mean when I say XML:

Then, to actually use what you’ve learned, you will need some tools. These are the main ones I use on a regular basis:

http://relaxng.org/
http://www.thaiopensource.com/nxml-mode/ — My favorite xml mode for emacs. Does relax-ng validation contiously as the file is beeing typed.
http://xmlsoft.org/XSLT/ — The XSLT C library for GNOME, includes the command line tool xsltproc.

Thats all for now! In the next installment (if any) I might write more on my way of implementing a CMS.

Comments

This post is 17 years old, comments are disabled.

Introduction to XML

Elements and tags

Attributes

The document tree

Text

Whitespace

Characters and Entities

Other markup

Comments

Processing instructions

Doctype and cdata

Name spaces

Specifying a name space

The name space URI

Special attributes

xml:id

xml:lang

xml:base

xml:space

xmlns or xx:xmlns

Correctness

Read more

Comments

`xml:id`

`xml:lang`

`xml:base`

`xml:space`

`xmlns` or `xx:xmlns`