The oxf:
protocol works only with resource managers that allow
accessing the actual path of the file. These include the Filesystem
and WebApp resource manager.
Directory Scanner
- 1. Introduction
- 2. Inputs and Outputs
- 3. Configuration
- 4. Output Format
- 4.1. Basic Output
- 4.2. Image Metadata
- 4.3. Other Metadata
- 5. Ant Patterns
1. Introduction
The purpose of the Directory Scanner processor is to analyse a directory structure in a filesystem and to produce an XML document containing metadata about the files, such as name and size. It is possible to specify which files and directories to include and exclude in the scanning process. The Directory Scanner is also able to optionally retrieve image metadata.
2. Inputs and Outputs
Type | Name | Purpose | Mandatory |
---|---|---|---|
Input |
config
|
Configuration | Yes |
Output |
data
|
Result XML data | Yes |
The Directory Scanner is typically called this way from XPL pipelines:
3. Configuration
The config
input configuration has the following format:
Element | Purpose | Format | Default |
---|---|---|---|
base-directory
|
Directory under which files and directories are scanned, referred to below
as the
|
A Note
|
None. |
include
|
Specifies which files are included | Apache Ant pattern. | None. |
exclude
|
Specifies which files are excluded | Apache Ant pattern. | None. |
case-sensitive
|
Whether include and exclude patterns are case-sensitive. |
true or
false .
|
true
|
default-excludes
|
Whether a set of default exclusion rules must be automatically loaded. The list is as follows:
|
true or
false .
|
false
|
image-metadata/basic-info
|
Whether basic image metadata must be extracted. |
true or
false .
|
false
|
image-metadata/exif-info
|
Whether Exif image metadata must be extracted. |
true or
false .
|
false
|
image-metadata/iptc-info
|
Whether iptc image metadata must be extracted. |
true or
false .
|
false
|
4. Output Format
4.1. Basic Output
The image format starts with a root directory
element with a
name
and path
attribute. The name
attribute
specifies the name of the search directory, e.g. web
. The
path
attribute specifies an absolute path to that directory.
The root element then contains a hierarchical structure of directory
and file
elements found. For example:
directory
elements contain basic information about a matched directory:
Name | Value |
---|---|
path
|
Path to the directory, relative to the parent directory. Includes the current directory name. |
name
|
Local directory name. |
The path
attribute on the root element is an absolute path from
a filesystem root. The path
on child directory
element are relative to their parent directory
element.
file
elements contain basic information about a matched file:
Name | Value |
---|---|
last-modified-ms
|
Timestamp of last modification in milliseconds. |
last-modified-date
|
Timestamp of last modification in XML xs:dateTime format.
|
size
|
Size of the file in bytes. |
path
|
Path to the file, relative to the parent directory. Includes the file name. |
name
|
Local file name. |
4.2. Image Metadata
When the configuration's image-metadata
element is specified,
metadata about images is extracted.
Images are identified by reading the beginning of the files. This means that extracting image metadata is usually more expensive in time than just producing regular file metadata.
When an image is identified, an image-metadata
element is
available under the corresponding file
element:
When image-metadata/basic-info
is true
in the
configuration, a basic-info
element is created under
image-metadata
:
Element Name | Element Value |
---|---|
content-type
|
Media type of the file: image/jpeg , image/gif ,
image/png . Other image/* values may be
produced for other image formats.
|
width
|
Image width, if found. |
height
|
Image height, if found. |
comment
|
Image comment, if found (JPEG only). |
When image-metadata/exif-info
is true
in the
configuration, zero or more exif-info
elements are created under
image-metadata
. Each element has an attribute containing the
name
of the category of Exif information. Basic Exif information
has the name Exif
. Other names may include Canon
Makernote
for a Canon camera, Interoperability
, etc. Under
each exif-info
element, zero or more param
elements
are contained, with the following sub-elements:
Element Name | Element Value |
---|---|
id
|
The Exif parameter id. For example, 271 denotes the make
of the camera
|
name
|
A default English name for the given parameter id, when known, for
example Make .
|
value
|
The value of the parameter, for example Canon .
|
This is an example of file
element with image metadata:
When image-metadata/iptc-info
is true
in the
configuration, zero or more iptc-info
elements are created under
image-metadata
. Each element has an attribute containing the
name
of the category of IPTC information. The children element of
iptc-info
are the same as for exif-info
.
4.3. Other Metadata
The Directory Scanner does not provide metadata about other files at the moment, but the processor could be extended to support more metadata, about image formats but also about other file formats such as sound files, etc.
5. Ant Patterns
This section of the documentation is reproduced from a section of the Apache Ant Manual, with minor adjustments.
Patterns are used for the inclusion and exclusion of files. These patterns look very much like the patterns used in DOS and UNIX:
'*' matches zero or more characters, '?' matches one character.
In general, patterns are considered relative paths, relative to a task dependent
base directory (the dir attribute in the case of <fileset>
). Only
files found below that base directory are considered. So while a pattern like
../foo.java
is possible, it will not match anything when applied since
the base directory's parent is never scanned for files.
Examples:
*.java
matches
.java
,
x.java
and
FooBar.java
, but not
FooBar.xml
(does not end with
.java
).
?.java
matches
x.java
,
A.java
, but not
.java
or
xyz.java
(both don't have one character before
.java
).
Combinations of
*
's and
?
's are allowed.
Matching is done per-directory. This means that first the first directory in
the pattern is matched against the first directory in the path to match. Then
the second directory is matched, and so on. For example, when we have the pattern
/?abc/*/*.java
and the path
/xabc/foobar/test.java
, the first
?abc
is matched with
xabc
, then
*
is matched with
foobar
, and finally
*.java
is matched with
test.java
.
They all match, so the path matches the pattern.
To make things a bit more flexible, we add one extra feature, which makes it
possible to match multiple directory levels. This can be used to match a
complete directory tree, or a file anywhere in the directory tree.
To do this,
**
must be used as the name of a directory.
When
**
is used as the name of a
directory in the pattern, it matches zero or more directories.
For example:
/test/**
matches all files/directories under
/test/
,
such as
/test/x.java
,
or
/test/foo/bar/xyz.html
, but not
/xyz.xml
.
There is one "shorthand" - if a pattern ends
with
/
or
\
, then
**
is appended.
For example,
mypackage/test/
is interpreted as if it were
mypackage/test/**
.
Example patterns:
**/CVS/*
|
Matches all files in
CVS
directories that can be located
anywhere in the directory tree.
Matches: CVS/Repository org/apache/CVS/Entries org/apache/jakarta/tools/ant/CVS/EntriesBut not: org/apache/CVS/foo/bar/Entries( foo/bar/ part does not match)
|
org/apache/jakarta/**
|
Matches all files in the
org/apache/jakarta
directory tree.
Matches: org/apache/jakarta/tools/ant/docs/index.html org/apache/jakarta/test.xmlBut not: org/apache/xyz.java( jakarta/ part is missing).
|
org/apache/**/CVS/*
|
Matches all files in
CVS directories
that are located anywhere in the directory tree under
org/apache .
Matches: org/apache/CVS/Entries org/apache/jakarta/tools/ant/CVS/EntriesBut not: org/apache/CVS/foo/bar/Entries( foo/bar/ part does not match)
|
**/test/**
|
Matches all files that have a
test
element in their path, including
test as a filename.
|
When these patterns are used in inclusion and exclusion, you have a powerful way to select just the files you want.