Getting started with XProc

What is it?

XProc is a language for specifying pipeline transformations on XML documents. It is a W3C specification. XProc allows the developer to tie together XML transformations. At the core of XProc is the concept of a step. Developers create compound steps by linking together other steps. The majority of common processing XML operations already exist as built-in steps:

  • validation
  • XSLT
  • splitting
  • adding namespaces
  • filtering
  • adding UUID attributes
  • HTTP requests
  • XInclude processing

There are a few implementations of XProc at this time — the most common is calabash. In addition to being available as a download, Calabash is built into the <oXygen/> xml editor.

The first step

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" 
 version="1.0" name="pipeline">
  <p:input port="source"/>
  <p:output port="result">
    <p:pipe port="result" step="output-input"/>
  </p:output>
 <p:identity name="output-input">
  <p:input port="source">
     <p:pipe port="source" step="pipeline"/>
   </p:input>
  </p:identity>
</p:declare-step>

This pipeline takes a an input document and outputs it. Not terribly useful right now but it is the simplest pipeline we can build.

p:declare-step

All the steps1 you create are encapsulated by p:declare-step elements. It is the root element and the container for steps you might define.

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" 
  version="1.0" name="pipeline">
    … 
</p:declare-step>

We declare the XProc namespace, the XProc version and we name our step. Steps do not always need to be named but life will be simpler if you do. If you name it you can refer to it and any error messages you might receive will be considerably easier to understand.

p:input

An input defines an input port. You can think of a port as a location to which something can be connected. Content is supplied to a step via an input port. Ports are named using the port attribute. This port is not connected to anything. The most common place you would use this is at top of level of a script because the input is coming from the outside.

<p:input port="source"/>

Input ports can be connected to things

This port is connected to an external document using p:document:

<p:input port="source">
  <p:document href="input.xml"/>
</p:input>

This port is connected to an inline document using p:inline:

<p:input port="stylesheet">
  <p:inline>
    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
    </xsl:stylesheet>
  </p:inline>
</p:input>

Most importantly, we can connect steps together using the p:pipe element. This allows us to be connect the output of one step to the input of another so that XML data can flow through:

<p:input port="source">
  <p:pipe port="source" step="pipeline"/>
</p:input>

Our input here is connected to another step. If you look back to the First step, you can see that the input for the identity step is connected to the input to whole script. This is what you will normally find yourself doing — the first step in your script needs to process the input to the entire script.

When we use p:pipe to refer to a step we use the port attribute to reference the input or output port name and the step attribute to reference the step itself.

p:output

p:output defines an output port. The results of steps are accessed through output ports. Output ports can be set up in much the same way as input ports. However, static documents and inline documents are generally not much use on output. Normally we would want to process the output of computation.

<p:output port="result">
  <p:pipe port="result" step="output-input"/>
</p:output>

Running the script.

Calabash is written in Java. So, we can run it using java. If you are frequently running XProc scripts from the command prompt or shell, you are probably best off with a batch file or shell script to wrap up the complexity.

c:\samples java -jar calabash.jar --input source=test.xml ex01.xpl
c:\samples java -cp calabash.jar com.xmlcalabash.drivers.Main --input source=test.xml 

Calabash maps your script’s inputs to command line arguments using the --input argument. Note that this needs to come before the script name in the list of arguments.

We haven’t specified a result port so the result is going to be sent to standard output. We could use the --output argument:

c:\samples java -jar calabash.jar --input source=test.xml --output result=test-out.xml ex01.xpl

The value of the argument is the name of the input/output, an equals sign and the URI of the input or output. These are relative to the current working directory when you run the script.

Doing something more useful

In order to achieve anything particularly useful in XProc we need to tie steps together. Consider a process where we need to read multiple XML files and create a single output file by wrapping all inputs with a wrapper element.

We could achieve that in XSLT without a great deal of effort but it’s a nice sample that we can build on:

Example 1
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0" name="pipeline">
	
  <p:input port="source" sequence="true"/>
	
  <p:output port="result">
    <p:pipe port="result" step="wrap-inputs"/>
  </p:output>
	
  <p:wrap-sequence name="wrap-inputs" wrapper="wrapper">
    <p:input port="source">
      <p:pipe port="source" step="pipeline"/>
    </p:input>
  </p:wrap-sequence>
	
</p:declare-step>

This step adds two new features. We have modified the definition of the input port to be:

<p:input port="source" sequence="true"/>

The sequence attribute states the port accepts a sequence of zero or more XML documents rather than a single document.

We’ve also added a p:wrap-sequence step. This takes a sequence as its input, and creates a single document made up of the inputs wrapped with the specified element (wrapper in this case).

We could run this script from the command line as follows:

c:\samples java -jar calabash.jar --input source=test01.xml --input source=test02.xml ex01.xpl

Stepping back

p:pipeline vs p:declare-step

For simple scripts we can use a shorthand for p:declare-stepp:pipeline. p:pipeline creates a step with certain defaults:

<p:pipeline name="my-pipeline">
  <!-- some steps -->
</p:pipeline>

It is exactly the same as :

<p:declare-step name="my-pipeline">
  <p:input port='source' primary='true'/>
  <p:input port='parameters' kind='parameter' primary='true'/>
  <p:output port='result' primary='true'/>
  <!-- some steps -->
</p:declare-step>

We’ll discuss parameter ports later but we need to talk about input and output ports now.

Ports, inputs and outputs

XProc is all about ports and pipes. Steps have ports and we connect them together with pipes. A input port defines a location to which XML content can be sent by a step. An output port defines a location from which XML documents can be retrieved from a step. XProc requires that all primary input and output ports must be connected to something.

Pipes provide the connections between steps along which XML documents flow. A pipe allows a port from one step to be connected to a port on another step. Much of the time we will connect the output from one step to the input of another. Sometimes, when we are creating steps, we will connect an input of a parent step to an input of one its children.

Primary inputs

So far, we’ve made all the connections between our steps quite explicit. However, the connections can be implicit. The rules are sometimes not obvious, though. The problems tend to happen when we starting dealing with default and primary inputs and outputs.

Steps can have multiple inputs or just one. You can declare one input port to be the primary input for a step:

<p:declare-step name="my-step">
  <p:input port="source" primary="true"/>
  <!-- some steps -->
</p:declare-step>

Additionally, if a step only has one input that is automatically promoted to become the primary input of the step. If a step has multiple inputs and none of them is marked as primary then there is no primary input for that step. Finally, if you have only one input and you explicitly mark it as not being primary (primary="false") then the step has no primary input.

Primary outputs

Primary outputs are very similar to primary inputs. Again, if you have more than output you can indicate which is primary (or have none). If there is only one output and it is not explicitly not primary (primary="false") then it will become the primary output for the step.

<p:declare-step name="my-step">
  <p:output port="result" primary="true"/>
  <!-- some steps -->
</p:declare-step>

What do primary inputs and outputs do?

A primary port has two implications

  1. Primary inputs and outputs must be connected to something. It’s a static (before the script starts running) error if a primary port isn’t connected to anything.
  2. Primary ports are automatically connected. Simply put if a step has an unconnected primary output and the immediately following step has an unconnected primary input, they will be automatically connected —
<p:xsl name='xsl-step'>
  …
</p:xsl>

<p:add-attribute name='my-add'>
  …
</p:add-attribute>

  1. Not quite true but good enough for right now.