This processor is a contribution from TAO Consulting Pte Ltd.
PDF Extraction Processor
- 1. Rationale
- 2. Usage
- 2.1. Config Input
- 2.2. Data Input
- 2.3. Data Output
1. Rationale
The PDF extraction processor will take an PDF file as input and extract meta information and or text from the PDF. The extracted structure can be used by fulltext search or a content management system to identify the exact location of content items. The processor makes use of the PDFBox Library PDFBox on Sourceforge.
The processor allows 5 modes of operation. In mode one only the meta data and the bookmarks with title and page number but no text is extracted. In mode two meta data, bookmarks and text is extracted. In mode three meta data and text broken down into pages is extracted. In mode 4 only the meta data gets extracted (if any). In mode five first an extraction of bookmarks with text is attempted. If there are no bookmarks a fallback to pages is performed.
If the file containes errors, then the operation might not complete. Most errors are captured and will lead to the insertion of an <error> tag. If the input file is fundamentally broken no output will be displayed (however the processor returns an empty <PDFDocument /> entry.
2. Usage
Processor Name | tao:from-pdf-converter |
---|---|
config input | Definition of the scope of extraction. |
data input | The PDF document in Base64 encoding |
data output | The XML Structure extracted from the PDF |
2.1. Config Input
The configuration input selects the mode of the extraction. Possible keywords are: bookmarks, bookmarksonly, bookmarkspages, meta or pages. Depending on that the extraction takes place. "bookmarkspages" attempts to extract bookmarks with text enclosed. If the PDF doesn't contain bookmarks the processor falls back to extract text by page.
2.2. Data Input
The data input must contain the PDF converted to Base64. This can happen from URL Generator or the x-forms upload control. The Base64 encoding must comply to the binary document format.
2.3. Data Output
The data output is a XML structure with all PDF specific information. It starts with a PDFDOcument element followed by the PDFMetadata element that contains PDF meta data according to the Adobe PDF specification. The document is then followed by either Bookmark or Page Elements
For example, the following element could be generated: