Orbeon Forms User Guide

Non-XML Documents in XPL

1. Introduction

In Orbeon Forms XPL and pipelines only deal with XML documents. This means that between processor outputs and processor inputs in a pipeline, only pure XML infosets circulate. There is however often a need to handle non-XML data in pipelines, in particular:

  • Binary document: any document that can be represented as a stream of bytes. In general this is the case of any document, but some document formats are almost always represented this way: images, sounds, PDF documents, etc.
  • Text documents: any document that can be represented as a stream of characters. Some documents are better looked at this way, like plain txt files, HTML files, and even the textual representation of XML.

Orbeon Forms addresses this question by defining two standard XML document formats to embed binary and text documents within an XML infoset. This solution has the benefit of keeping XPL simple by limiting it to pure XML infosets, while allowing XPL to conveniently manipulate any binary and text document.

2. Binary Documents

A binary document consist of a document root node containing character data encoded with Base64. An xsi:type attribute is also present, as well as an optional content-type attribute, for example:

<document xsi:type="xs:base64Binary" content-type="image/jpeg">/9j/4AAQSkZJRgABAQEBygHKAAD/2wBDAAQDAwQDAwQEBAQFBQQFBwsHBwYGBw4KCggLEA4R ... KKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA//2Q==</document>
Note

For the curious, the Base64 encoding is documented in RFC 2045. This encoding represents binary data by mapping it to a set of 64 ASCII characters.

Such documents are not meant to be read by users, in the same way that regular binary files are not meant to be examined by users. Binary documents are generated by Orbeon Forms processors, like the URL generator and converters. They are consumed by processors like the HTTP serializer, the Email processor, and converters.

3. Text Documents

A text document consists of a document root element containing the text. An xsi:type attribute is also present, as well as an optional content-type attribute:

<document xsi:type="xs:string" content-type="text/plain">This is line one of the input document! This is line two of the input document! This is line three of the input document!</document>

The content-type attribute may have a charset parameter providing a hint for the character encoding, for example:

<document xsi:type="xs:string" content-type="text/plain; charset=iso-8859-1">This is line one of the input document! This is line two of the input document! This is line three of the input document!</document>

Because XML character data itself is represented in Unicode (in other words it is designed to allow representing in a same document all the characters specified by the Unicode specification), there is no requirement for specifying character encoding in XML pipelines. However, when an XML infoset is read or written as an textual XML document, specifying a character encoding may may be a useful hint. For example a URL generator can, with this mechanism, communicate to an HTTP serializer the preferred character encoding obtained when the document was read. The serializer may then use that hint, but it is by no means authoritative.

In general, XML documents can be read and written using the utf-8 character encoding, which allows representing all the Unicode characters. However, when dealing with other types of text documents, tools such as text editors may not be able to deal correctly with utf-8. In such cases, it can be useful to use even more widespread character encodings such as iso-8859-1 or us-ascii. The drawback is that such encodings allow representing a much smaller set of characters than utf-8.

Unlike binary documents, text documents can easily be examined by users. They can also be easily manipulated by languages such as XSLT. Like binary documents, they are generated by Orbeon Forms processors, like the URL generator and converters. They are consumed by processors like the HTTP serializer, the Email processor, and converters.

4. Streaming

Processors can stream binary and text documents by issuing a number of short character SAX events. It is therefore possible to generate "infinitely" long binary and text documents with a constant amount of memory, assuming both the sender and the receiver of the document are able to perform streaming. This is the case for example of the URL generator and the HTTP serializer.