Xatapult's XML Blog

20/11/2014

Joy of the XML container

Filed under: Uncategorized — xatapult @ 14:35

One of the things you frequently have to do when working with real-world XML content is process multiple related XML documents. An example is processing Office Open content (better known as Microsoft Office documents). A single Office Open document (like a .docx file) consists of many closely related XML (and other) documents. All are packed inside the .docx (or .xlsx or .pptx or …) file, that is actually a zip file. A simple but surprisingly effective construction for working with this is to wrap all content in a standardized “XML container” format. This enormously simplifies reading, processing and writing.

XML Containers

An XML Container is a simple construction that enables you to process multiple XML documents as one. The principal idea is to wrap all relevant XML documents inside another document (with a separate namespace to avoid name clashes). For example:

<xmlContainer xmlns="http://my/container/namespace/declaration">

  <document source="path/to/original/document/1">
    <!-- Contents of first XML document -->
  </document>
  <document source="path/to/original/document/2">
    <!-- Contents of second XML document -->
  </document>
  … etc.

</xmlContainer>

You can also add non-XML documents (or rather: links to non-XML documents), like this:

<xmlContainer xmlns="http://my/container/namespace/declaration">

  <document source="path/to/original/document/1">
    <!-- Contents of first XML document -->
  </document>
  … etc.

  <externalDocument source="path/to/jpg/asset"/>
  … etc.

</xmlContainer>

If you put all your relevant content in a container construction like this, you can easily work with and on the full content. Things that are normally hard to do in languages such as XSLT, for instance checking references to images, become easy because the images are “present” as <externalDocument> entries.

To allow writing the content to some location we can (somewhere while processing the content) add target information, like this:

<xmlContainer xmlns="http://my/container/namespace/declaration">

  <document source="path/to/original/document/1" target="path/to/target/document/1">
    <!-- Contents of first XML document -->
  </document>
  … etc.

  <externalDocument source="path/to/jpg/asset" target="path/to/target/jpg/asset"/>
  … etc.

</xmlContainer>

One more useful thing to do is adding (optional) information for working with zip files (since this is an often occurring situation). For this we add the paths of the source and target zip file on the root element:

<xmlContainer xmlns="http://my/container/namespace/declaration"
  source-zip="path/to/source/zip/file" target-zip="path/to/target/zip/file">

  <document source="path/to/original/document/1" target="path/to/target/document/1">
    <!-- Contents of first XML document -->
  </document>
  … etc.

  <externalDocument source="path/to/jpg/asset" target="path/to/target/jpg/asset"/>
  … etc.

</xmlContainer>

Standardization benefits

At this moment you might shrug and think: “Yeah, ok, nothing special. Why?”. Well, the benefits of this approach do not come from the clever format I suggest here: it’s actually not very clever at all and simple to the extreme. The big benefits come from standardizing the format in your processing chain.

When we adopt some standard format for a container structure, we can write reusable software (write once, use many) that:

  • Reads any zip file or directory structure into a container (including the references to the non-XML documents)
  • Writes a container back out to a zip file or directory structure (again, including the non-XML documents, these are copied from their source)
  • Write libraries for navigating inside this container for specific purposes. For instance, I have an XSLT library for navigating around Open Office documents (which is not easy given the indirect and rather complicated nature of all their document-to-document references).

Given this approach I wrote for instance software that changed and generated Office Open spreadsheets: Read the .xlsx zip file into a container, process this using standard XML techniques (XProc, XSLT, XQuery) and write the result back into a .xlsx zip file. That would have been much, much harder without this container technique, because processing an Office Open document involves working with many XML documents at the same time. And hey, they are now all inside this container and therefore easily accessible.

What standard?

There might be a public standard out there that does exactly this, but since it is so simple, I created my own. It consists of one, tiny, XML schema that defines my container format.

Implementation

I implemented the XML container idea with my favorite X technology du jour: XProc. Here is a pipeline fragment that implements reading the contents of a zip file in a container (omitting details):

<pxp:unzip>
  <p:with-option name="href" select="$source-zip"/>
</pxp:unzip>
<!-- -->
<!-- Loop over all contents and try to get it in: -->
<p:for-each>
  <p:iteration-source select="//c:file"/>
  <!-- -->
  <p:variable name="source" select="/*/@name"/>
  <!-- -->
  <p:try>
    <!-- -->
    <!-- Try to get it as XML: -->
    <p:group>
      <pxp:unzip>
        <p:with-option name="href" select="$source-zip"/>
        <p:with-option name="file" select="$source"/>
      </pxp:unzip>
      <p:wrap match="/*" wrapper="document"/>
    </p:group>
    <!-- -->
    <!-- Something went wrong, so this is not XML... Add it as an external document:-->
    <p:catch>
      <p:identity>
        <p:input port="source">
          <p:inline>
            <externalDocument/>
          </p:inline>
        </p:input>
      </p:identity>
    </p:catch>
    <!-- -->
  </p:try>
  <!-- -->
  <!-- Add the reference to the document in the zip: -->
  <p:add-attribute match="/*" attribute-name="source">
    <p:with-option name="attribute-value" select="$source"/>
  </p:add-attribute>
  <!-- -->
</p:for-each>
<!-- -->
<!-- Add root and attributes: -->
<p:wrap-sequence wrapper="xmlContainer"/>
<p:add-attribute attribute-name="source-zip" match="/*">
  <p:with-option name="attribute-value" select="$source-zip"/>
</p:add-attribute>

You can do likewise for operations like directory-to-container or container-to-disk.

I have not (yet?) made this format and software public because it is tied to some very specific libraries and software setup. However, If you need help or would like example code, feel free to contact me.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: