The URL must point to a well-formed XML document. If it doesn't, an exception will be raised.
URL Generator
- 1. Introduction
- 2. Content Type
- 3. XML Mode
- 4. HTML Mode
- 5. Text Mode
- 6. Binary Mode
- 7. Character Encoding
- 8. HTTP Headers
- 9. Cache Control
- 10. Relative URLs
1. Introduction
Generators are a special category of processors that have no XML data inputs, only outputs. They are generally used at the top of an XML pipeline to generate XML data from a Java object or other non-XML source.
The URL generator fetches a document from a URL and produces an XML output document.
Common protocols such as http:
, ftp:
, and
file:
are supported as well as the Orbeon Forms resource
protocol (oxf:
). See Resource
Managers for more information about the oxf:
protocol.
2. Content Type
The URL generator operates in several modes depending on the content type of the source document. The content type is determined according to the following priorities:
-
Use the content type in the
content-type
element of the configuration ifforce-content-type
is set totrue
. -
Use the content type set by the connection (for example, the content type
sent with the document by an HTTP server), if any. Note that when using the
oxf:
orfile:
protocol, the connection content type is never available. When using thehttp:
protocol, the connection content type may or may not be available depending on the configuration of the HTTP server. -
Use the content type in the
content-type
element of the configuration, if specified. -
Use
application/xml
.
3. XML Mode
The XML mode is selected when the content type is text/xml
,
application/xml
, or ends with +xml
according to the
selection algorithm above. The generator fetches the specified URL and parses the
XML document. If the validating
option is set to true
, a
validating parser is used, otherwise a non-validating parser is used. Using a
validating parser allows to validate a document with a DTD. In addition, the URL
generator is able to handle XInclude inclusions during parsing. By default, it does
so. This can be disabled by seeting the handle-xinclude
option to
false
.
Example:
4. HTML Mode
The HTML mode is selected when the content type is text/html
according to the selection algorithm above. In this mode, the URL generator
uses HTML Tidy to transform
HTML into XML. This feature is useful to later extract information from HTML
using XPath.
Examples:
The <tidy-options>
part of the configuration in the two examples above is optional.
However, by default quiet
is set to false, which causes HTML Tidy to output messages to
the console when it finds invalid HTML. To prevent this, add a <tidy-options>
section
to the configuration with quiet
set to true.
Even if HTML Tidy has some tolerance for malformed HTML, you should use well-formed HTML whenever possible.
5. Text Mode
The text mode is selected when the content type according to the selection
algorithm above starts with text/
and is different from
text/html
or text/xml
, for example
text/plain
. In this mode, the URL generator reads the input as a
text file and produces an XML document containing the text read.
Example:
Assume the input document contains the following text:
This is line one of the input document!
This is line two of the input document!
This is line three of the input document!
The resulting document consists of a document
root element
containing the text according to the text document format. An
xsi:type
attribute is also present, as well as a
content-type
attribute:
The URL generator performs streaming. It generates a stream of short character SAX events. It is therefore possible to generate an "infinitely" long document with a constant amount of memory, assuming the generator is connected to other processors that do not require storing the entire stream of data in memory, for example the SQL processor (with an appropriate configuration to stream BLOBs), or the HTTP serializer.
6. Binary Mode
The binary mode is selected when the content type is neither one of the XML
content types nor one of the text/*
content types. In this mode,
the URL generator uses a Base64 encoding to transform binary content into XML
according to the binary document
format. For example:
The resulting document consists of a document
root node containing
character data encoded with Base64. An xsi:type
attribute is also
present, as well as a content-type
attribute, if found:
The URL generator performs streaming. It generates a stream of short character SAX events. It is therefore possible to generate an "infinitely" long document with a constant amount of memory, assuming the generator is connected to other processors that do not require storing the entire stream of data in memory, for example the SQL processor (with an appropriate configuration to stream BLOBs), or the HTTP serializer.
7. Character Encoding
For text and XML, the character encoding is determined as follows:
-
Use the encoding in the
encoding
element of the configuration ifforce-encoding
is set totrue
. -
Use the encoding set by the connection (for example, the encoding sent with
the document by an HTTP server), if any, unless
ignore-connection-encoding
is set totrue
(for XML documents, precedence is given to the connection encoding as per RFC 3023). Note that when using theoxf:
orfile:
protocol, the connection encoding is never available. When using thehttp:
protocol, the connection encoding may or may not be available depending on the configuration of the HTTP server. The encoding is specified along with the content type in thecontent-type
header, for example:content-type: text/html; charset=iso-8859-1
. -
Use the encoding in the
encoding
element of the configuration, if specified. - For XML, the character encoding is determined automatically by the XML parser.
- For text, including HTML: use the default of iso-8859
When reading XML documents, the preferred method of determining the character
encoding is to let either the connection or the XML parser auto detect the
encoding. In some instances, it may be necessary to override the encoding. For
this purpose, the force-encoding
and encoding
elements
can be used to override this default behavior, for example:
This use should be reserved for cases where it is known that a document specifies an incorrect encoding and it is not possible to modify the document.
HTML example:
Note that only the following encodings are supported for HTML documents:
- iso-8859-1
- utf-8
Also note that use of the HTML <meta>
tag to specify the
encoding from within an HTML document is not supported.
8. HTTP Headers
When retrieving a document from an HTTP server, you can optionally specify the
headers sent to the server by adding one or more header
elements,
as illustrated in the example below:
9. Cache Control
It is possible to configure whether the URL generator caches documents locally
in the Orbeon Forms cache. By default, it does. To disable caching, use
the cache-control/use-local-cache
element, for example:
Using the local cache causes the URL generator to check if the document is in
the Orbeon Forms cache first. If it is, its validity is checked with the
protocol handler (looking at the last modified date for files, the
last-modified
header for http, etc.). If the cached document is
valid, it is used. Otherwise, it is fetched and put in the cache.
When the local cache is disabled, the document is never revalidated and always fetched.
10. Relative URLs
URLs passed to the URL generator can be relative. For example, consider the
following pipeline fragment declared in a file called
oxf:/my-pipelines/backend/import.xpl
:
In this case, the URL resolves to:
oxf:/documents/claim.xml
.