background image
DigiCULT
.
Info
47
U
SING
S
PATIAL
K
NOWLEDGE
TO
C
LASSIFY
M
ETADATA
T
o date, the creation of metadata
remains a bottleneck for many institu-
tions collecting digital material. It is, there-
fore, natural that there are increasing levels
of research in this area. The automated
extraction of technical metadata is achiev-
able for many file formats as this informa-
tion is encoded into the specification of
the format. Several tools are under devel-
opment to cater for this task, for exam-
ple, the National Library of New Zealand
Preservation Metadata Extract Tool,51
OCLC's PREMIS (PREservation Metadata:
Implementation Strategies)52 and the work
of the Global Digital Format Registry.53
While impossible to guarantee, it would not
be surprising to see these tools achieve a
high level of accuracy and reliability.
E
xtracting semantic metadata from the
content of documents is a different
matter. Existing file formats tend not to
contain semantic markup and it remains a
difficult task to distinguish between sections
of a text-based document. Various tech-
niques are under investigation. This article
describes a technique for extracting seman-
tic metadata from documents produced in
the PostScript (PS) format.
P
ostScript is both a simple program-
ming language and a page description
format, designed to allow powerful graph-
ics capabilities.54 When text is stored in a
PostScript file, it tends to be stored as plain
text surrounded by method calls initialis-
ing the appropriate display configuration.
This includes the xy page location where
printing should begin, the font name, font
size, and font widths. Printing a PostScript
document requires an interpreter, often
GhostScript (http://www.cs.wisc.edu/
~ghost/), to translate this code into print-
able commands.
S
everal methods have been proposed to
extract metadata from a document. A
popular method of classifying documents is
to employ statistical frequencies of words to
categorise elements; however, this method is
more appropriate for document summarisa-
tion tasks. An alternative is to use the spatial
knowledge we have of documents to classi-
fy certain elements; for example, a title gen-
erally appears at the top of a page and is in
a larger font size. This idea has been used to
implement a metadata generation system at
the US National Library of Medicine, and
is discussed at length in `Knowledge Based
Metadata Extraction from PostScript Files'
by G. Giuffrida et al.55
T
he technique requires extraction of
text from a document and associating
information about the font, metrics, and
xy location to each line. A rule set can be
applied to these strings (a string is a com-
plete horizontal row of text containing no
line breaks) to produce increasingly accu-
rate candidates for a particular element.
O
pen source PostScript to text con-
verters already exist which redi-
rect text from the printer to a text file.
Pstotext (http://www.cs.wisc.edu/~ghost/
doc/pstotext.htm) is arguably the most sta-
ble, although prescript (http://www.nzdl.
org/html/prescript.html) is also of note.
To output text, each of these contain small
PostScript programs that override the out-
put methods of a GhostScript processor
and redirect the text. The pstotext program
also utilises font and metrics information
applied to each fragment of text to ensure
the document is reconstructed correctly.
I
t is possible to extend this program to
output text and its additional informa-
tion at a string level. It is then possible to
apply rules to a set of strings determining
additional implicit properties. Determining
and refining such a rule collection allows
the classification of items such as title,
author, date of publication, abstract and
table of contents. Such a rule set may
require a document to contain a certain
layout to be recognised, but as the rule set
can be extended additional configurations
can be interpreted.
A
simplified rule set for title identifica-
tion may find that the title:
1. is generally found on the first page;
2. precedes the abstract or introduction;
3. contains the largest text on the page.
A
ttractively, this rule identification sys-
tem can be applied to multiple file
formats, assuming the correct information
can be output. For example, the Portable
Document Format (PDF) is built upon the
PostScript format and contains similar font
and metrics information.56 Additionally,
this work can be combined with alterna-
tive metadata generation and document
summarisation tools, allowing the time
and technical requirements of ingest to be
reduced and streamlined. As a collaborative
technique, this is very promising for dig-
ital library systems and has potentially huge
benefits for digital collections of all kinds.
A
DAM
R
USBRIDGE
,
ERPANET T
ECHNICAL
A
NALYST
51 This tool has been submitted for the Digital
Preservation Award 2004, a new award described in
`Recognising Advances in Digital Preservation', also in this
issue. For more information, see http://www.natlib.govt.
nz/files/Project%20Description_v3-final.pdf.
52 http://www.oclc.org/research/projects/pmwg/
53 http://hul.harvard.edu/gdfr/
54 For PostScript specifications, see http://partners.adobe.
com/asn/tech/ps/specifications.jsp.
55 Article full text available at http://citeseer.ist.psu.
edu/385845.html.
56 For PDF specifications, see http://partners.adobe.com/
asn/tech/pdf/specifications.jsp.