Orbeon Forms User Guide

PDF Extraction Processor

1. Rationale

The PDF extraction processor will take an PDF file as input and extract meta information and or text from the PDF. The extracted structure can be used by fulltext search or a content management system to identify the exact location of content items. The processor makes use of the PDFBox Library PDFBox on Sourceforge.

The processor allows 5 modes of operation. In mode one only the meta data and the bookmarks with title and page number but no text is extracted. In mode two meta data, bookmarks and text is extracted. In mode three meta data and text broken down into pages is extracted. In mode 4 only the meta data gets extracted (if any). In mode five first an extraction of bookmarks with text is attempted. If there are no bookmarks a fallback to pages is performed.

If the file containes errors, then the operation might not complete. Most errors are captured and will lead to the insertion of an <error> tag. If the input file is fundamentally broken no output will be displayed (however the processor returns an empty <PDFDocument /> entry.

Note

This processor is a contribution from TAO Consulting Pte Ltd.

2. Usage

Processor Name tao:from-pdf-converter
config input Definition of the scope of extraction.
data input The PDF document in Base64 encoding
data output The XML Structure extracted from the PDF

2.1. Config Input

The configuration input selects the mode of the extraction. Possible keywords are: bookmarks, bookmarksonly, bookmarkspages, meta or pages. Depending on that the extraction takes place. "bookmarkspages" attempts to extract bookmarks with text enclosed. If the PDF doesn't contain bookmarks the processor falls back to extract text by page.

<config><action>bookmarks</action></config>

2.2. Data Input

The data input must contain the PDF converted to Base64. This can happen from URL Generator or the x-forms upload control. The Base64 encoding must comply to the binary document format.

<p:input name="data" href="#file"/>

2.3. Data Output

The data output is a XML structure with all PDF specific information. It starts with a PDFDOcument element followed by the PDFMetadata element that contains PDF meta data according to the Adobe PDF specification. The document is then followed by either Bookmark or Page Elements

For example, the following element could be generated:

<PDFDocument pages="32" author="Stephan H. Wissel" title="R6 Migration Report" subject="Recommendation for Migration"><PDFMetadata>... a lot of stuff here ...</PDFMetadata><Bookmark level="1" page="2"><Title>Management Summary</Title></Bookmark><Bookmark level="1" page="3"><Title>Scope of work / Findings</Title><Bookmark level="2" page="3"><Title>Scope of work</Title></Bookmark></Bookmark></PDFDocument>