XML beyond pointy brackets

A primer to get you started. If you're not sure what XML is, or why it might be important to you, read on.

Where is XML used right now?

XML is a ubiquitous technology and you probably use it every day. It includes:
• Microsoft Office file formats
• InDesign
• EPUB files
• Many databases
• ONIX (metadata)
• ... and more

If you want to create, edit, store, transform and publish documents, you need to know a little about XML.

What is XML?

XML is a mark-up language (it stands for Extensible Mark-up Language). Mark-up is information added to the text of a document to supply additional information. Conventionally, we divide mark-up into three categories:

  • presentational
  • procedural
  • descriptive

Presentational mark-up is an attempt to show the structure of a document via cues in the content (e.g. putting multiple blank lines under the title in a text file). Procedural mark-up places formatting codes into the text of a document. So, the title above may be preceded by codes for bold, centring and font size. Descriptive mark-up (or semantic mark-up) attempts to describe the purpose of portions of documents without describing the formatting of those portions.

For example:

What does that have to do with language?

We refer to the sets of tags in procedural and semantic mark-up as tag-sets. We also refer to them as languages. These aren’t natural languages like English. These are artificial languages in the same way as programming languages. In fact, procedural mark-up has a lot in common with programming – it’s intended to be interpreted as a sequence of commands (“make this bold”) by software.

What’s so special about XML?

XML is actually a specification that describes how mark-up languages work. It is a meta-language – it describes other languages. All XML-derived languages share certain features:

  • opening and closing tags using angle brackets (<tag>) with the closing tag indicated by a forward slash (</tag>).
  •  attributes within the tags (<tag attribute=’value’>)
  • strict rules on how to create well-formed xml

XML is special because it supplies a set of simple rules for writing mark-up languages and guarantees a degree of interoperability between compliant systems.

A "meta-language" is simply a set of rules that tell you what your mark-up language should look like. It doesn’t tell you what the individual elements are or what attributes they can have or how you can combine those elements.

When an XML language is created, it is given a grammar (just like English but much, much simpler). That grammar describes the structure of the individual language. The term for those grammars is schemas. These schemas describe the actual elements, their relationship and their attributes. (I'll write more about schemas in a later post.)

Which XML language is right for you?

The primary benefits of XML conversion are realised with material that is likely to be processed to different formats, split and sold or licensed or repackaged. Most suppliers use XML as an export tool to drive down their costs; but if you as a publisher are going to use XML, you need a specification tailored to your content.

There are many kinds of XML and many reasons for creating it. Are you looking to make your content semantically searchable? Is your primary purpose to drive down production costs? Or did you want to be able to remix and repurpose content (recipes, encyclopaedias and so forth)? Are you using your XML content in a DAM/DAD system, in an XML-to-HTML pipeline, or exporting to InDesign or other formatting software?

All of these questions will impact on what DTD/schema you use, how detailed your specification should be, and what level of QA you require (automated only - or human interpreted?) Note that XML is particularly suited to automated QA processes and it's a wise investment to consider using these tools.

Different kinds of XML can be more or less useful for different kinds of content (and this can be complex). The table below suggests how you might want to start thinking about using XML effectively.

Some XML technologies

Learning xml is as simple as writing “<tag>some text and </tag>”... The tricky part is learning the tools you may need to use to parse it.

XML was created to structure, store, and transport information. An XML document does not DO anything. It is just information wrapped in tags. But the associated XML technologies are extremely powerful! Someone must write a piece of software to send, receive or display it. The design goals of XML include, “It shall be easy to write programs which process XML documents.” “XML” is used to refer to XML together with one or more of these other technologies that have come to be seen as part of the XML core.

Why XML?

Nimble content

XML allows separation of design and content. You want to publish your content on as many platforms as possible, as efficiently as possible. XML makes content agile, and it's the only technology that consistently does this. You want to make your content searchable (and findable!) - XML allows semantic mark-up. 

Archival and long-term file storage

Possibly the best thing about XML is that it's an open, robust technology with an active community of developers and users. File formats and specifications will change, but XML is not a proprietary format, it handles its own obsolescence:  Many tools have fallen by the wayside over the years. By storing your content in XML you can make it future proof. 

EPUB, InDesign, Word and PDF are not designed as archival formats (with long term storage in mind). Publishers, libraries and archives need to consider what file format they are going to use for their long-term preservation purposes both internally and externally. XML fits the bill.