Introduction to XML
Posted 2007-02-08 23:50. Tagged markup, xml.
Here’s a short introduction to XML, as I wasn’t able to find one I liked with a quick google search. Feedback is welcome!
My goal here is to cover a more or less complete understanding of XML, but no particular application or use.
The basic concept of xml is to have plain text information, and put mark up around parts of it. The parts might be chapters, headlines, and paragraphs, or they might be product descriptions, order numbers, and prises. Or really anyting meaningfull.
It is often a good idea to keep the markup semantic, specifying what a part of the text is rather than how it is supposed to look (the looks of things can often be specified in a separate stylesheet).
Elements and tags
An element in xml is a start-tag, some content, and an end-tag. The content can contain text and/or other elements. Elements cannot overlap partially.
A tag is wrapped in the characters “<” and “>”, and the end tag is marked by a slash followed by the tag name. A foo element containg the word bar looks like this:
bar
If an element has no content, the start-tag and the end-tag can be combined into an empty-element tag, where the slash marking the end of the element is put at the end of that tag. An empty foo element can look like this, note the trailing slash:
Attributes
Some elements have attributes that specify extra metadata for the content. An abbreviation might for example have its meaning specified:
XML
An attribute is allways specified as a name followed by an equals sign
and a quoted value (the plain ASCII double quote #34 (") or single
quote #39 ('), no fancy quotes
).
The document tree
An xml document must consist of exactly one root element and its contents. Since elements can’t overlap partially, each element (except the root node) has exactly one parent element. Hence, the entire document can be seen as a tree.
Attributes and text nodes are considered separate nodes, and the attributes of an element allways appears before the contents of the element. Thus the following simple document:
Hello, World! I am Rasmus .
… has the following document tree:
- element node text
- attribute node lang = en
- text node
Hello, World! I am
- element node name
- text node
Rasmus
- text node
- text node
.
The document tree is often a good way to think of a document, and should map directly to document semantics, even if some vocabularies (notably HTML) doesn’t enforce such a mapping.
Text
Whitespace
There is two different modes for how whitespace is handled, default or preserve.
When preserving whitespace, each whitespace (plain space, tab, new line, carriage return) is handled as just that character.
When in default
mode, each sequence of consecutive whitespace,
including space, horizontal tab, newline and carriage return, is
equivalent to a single whitespace.
There is a special attribute, xml:space
, that may be used to specify
the white space mode for the content of an element.
Characters and Entities
An xml file can be written using any character encoding, as long as it tells the reader which one. That is done on the first line of the file, which looks like this:
Both the version and the encoding has default values (for version,
only 1.0
and 1.1
is specied and 1.0 is the default, for encoding,
utf-8
is the default, but any encoding can be used) so the above is
equivalent to:
<?xml?>
Some characters have special meaning in xml. What if I want to express
the mathematical relation x < 2
? If I write it just like
that, the < will mark the beginning of a tag, and not the less
than
character. Luckily, there is something called a character
entity for this. Using that, I can write
and it will mean x < 2
x < 2
(there is also a specific vocabulary for
mathematical statements, called MathML, but that is way beoynd
the scope of this introduction).
The character “&
” always marks the start of an entity reference
(except in a comment). It is followed by a name and
a semicolon. XML defines a few character entities by default.
There is also numerical character entities, which refers to Unicode
characters (regardless of the document encoding). They can be
expressed as &#NNN;
in decimal or as &#xNNN;
in hexadecimal. For
example A
is a the letter A and α
is the greek letter α.
It is also possible to declare arbitrary named entities, but that requires using a DTD (see below) and isn’t used much today, except in legacy html documents.
Other markup
Comments
A comment in xml starts with <!--
and ends with -->
. It can
contain anything except two sequential dashes, including newlines and
commented out markup. Entitys is not recognized within comments.
<!-- This is a comment (which may contain < & >) -->
Comments can’t be nested, since a comment must not contain the string
--
. A common workaround for that is to put a space between the
dashes in the inner comment:
<!-- a comment
<!- - with a "nested" comment - ->
end of comment: -->
Processing instructions
A processing instruction is a special tag that tells a program reading the file what to do with it. It might for example tell it to use a specific xslt transform when presenting the document to a user.
A processing instruction, just like a tag, starts with a name. But
after that the syntax is freer, allowing anything but the string ?>
(which terminates the processing instruction). Detailed syntax of
processing instructions can be specific to the toolchain that
processes it.
Doctype and cdata
These are SGML legacy. A DOCTYPE
was used to tell the
parser which SGML vocabulary to expect, by linking to a
DTD. A CDATA
block was used for a block of verbatim
text without the normal SGML parsing. Both is possible
to use in the same way in XML, but neither is used much.
Name spaces
Several vocabularies might declare a tag with the same name but with different semantics and properties. To avoid confusion, each vocabulary can be declared in a name space of its own.
Specifying a name space
Each namespace is identified by an URI. In the simplest
case, a tag identifies the name space for itself and everything it
contains with the xmlns
attribute.
In the following example, both the tags foo and bar are inside the name space http://my.domain/name.
hello
Often, a document contains a mix of a few name spaces. It is of
course possible to use the above syntax, with a xmlns
attribute for
each element that is from a different namespace than its parent
element. However, that soon becomes extremely verbose. Then it is
more practical to declare a shortcut for each namespace and use that
on the elements.
Hello world ;
Note that when using this syntax, an unqualified name does not belong to the same namespace as that of its parent, but instead to the current default namespace (i.e. the namespace that is defined without a shortcut). The following example is equivalent to the previous one, the location element belongs to the http://my.domain/name namespace.
Hello world
The name space URI
A namespace is identified by an URI, not by a resource
found by that URI. On the web, for example
http://www.kth.se/
, https://www.kth.se
and
http://www.kth.se/index.html
may all identify the same resource
(either simply returning the same content, or redirecting to each
other). But if used as a namespace identityfiers, they would refer to
three separate namespaces.
A program processing the XML data has no need to fetch the actual resource identyfied by a name space URI, so useing a web uri does not require that the processor has network access. The name space URI does not even need to be specified useing a known protocol.
If you do point your web reader at a name space URI anyway, you will usually find a document specifying the semantics and syntax of the vocabulary.
Special attributes
xml:id
A unique identifier for the element the attribute belongs to. The
value is required to be unique in the file. Note: Some vocabularies
use id
without the xml prefix for this.
xml:lang
The (natural) language of the content of the element the attribute belongs to. The value should be a registered language code.
xml:base
An URI to use as the base when resolving relative URIs in the content of the element the attribute belongs to (regardless of link semantics). The value should be an URI.
xml:space
Determins which kind of white-space handling should be done for the
content of the element the attribute belongs to. The value should be
either default
or preserve
.
See the section on whitespace.
xmlns
or xx:xmlns
Correctness
Wether an xml file is correct or incorrect is a question that can be asked on serveral levels. On the most semantic level —wether the information given in the file is correct— correctnes obviously isn’t easily validated (those theoretically inclined is referred to Gödel’s incompleteness theorems). Luckily, there are also some correctness criteria that are easily formalized.
The most basic level of correctnes is that a document should be well-formed, meaning that the syntactic rules described above should be maintained. This level of correctnes can easily be validated, and there are many programs that does, including editors that revalidate automatically as content is typed.
The second level is that only declared tags are used, and only in
contexts and with contents that are formally allowed. For example, in
HTML a <tr>
(table row) must not occur anywhere except
inside a <table>
. Given a formal description of the tag vocabulary,
this validation can also be easily automated.
There are three common ways to give such a definition of the tag set.
- DTD is the way inherited from SGML. Important only if you need to use formats or tools that doesent support the other ways.
- W3C Schema is one of the alternatives to replace DTDs.
- RelaxNG is the DTD replacement I prefer.
Read more
Now that you’ve read this little introduction, you might want to know
more about XML and how to use it. The authorative source
is of course the specification, which currently exist in the form of a
base specification and a set of extensions. These specifications are
all included in what I mean when I say XML
:
- http://www.w3.org/TR/xml11
- http://www.w3.org/TR/xml-id/
- http://www.w3.org/TR/xmlbase/
- http://www.w3.org/TR/xml-names11/
Then, to actually use what you’ve learned, you will need some tools. These are the main ones I use on a regular basis:
- http://relaxng.org/
- http://www.thaiopensource.com/nxml-mode/ — My favorite xml mode for emacs. Does relax-ng validation contiously as the file is beeing typed.
- http://xmlsoft.org/XSLT/ — The XSLT C library for GNOME, includes the command line tool xsltproc.
Thats all for now! In the next installment (if any) I might write more on my way of implementing a CMS.
Comments
This post is 17 years old, comments are disabled.