background image
DigiCULT
.
Info
24
I
n a project of this size and
scope, planning is everything.
Background work started over
five years ago on copyright per-
missions, experiments with
microfilm duplication, software
enhancements, clipping and
categorisation rules, XML and
DTD creation as well as the
design and build of the delivery
platform.The endless design
meetings and conference calls,
though tedious at the time,
really underpinned the project
and ensured its success. And
from a personal point of view it
has been great fun: I am now a
good deal more knowledgeable
about the intricacies of image
processing and file organisation
and have a head stuffed full of
pirates, pickpockets and ladies'
maids.
metadata are returned to the UK for
checking before being forwarded to our
US site for indexing and uploading to
our own servers.
T
he process may appear both very
technical and very mechanical, but
software can only do so much. Many
months were spent manually checking
nearly every image for text that was too
faint, too damaged or too dark to OCR
well. Perhaps I should also mention the
detailed examination of hundreds of pages
to establish rules for the segmentation of
articles and the management of a growing
list of criteria for article categorisation.We
have seen eighteenth-century issues with-
out page numbers, issue numbers that
appear more than once, mastheads printed
upside down and articles with continua-
tions in earlier columns or pages.Where
the scanners have been unable to extract
any detail from the film pages we have
gone back to hard copy, but in many
instances the originals are even worse.
Figure 3:The article delivered is the second of the three
results shown in Figure 2.The search terms are high-
lighted in colour in the metadata and the body of an
article - especially useful in longer articles.
Figure 4: Page 6 of The Times of 8 June 1953. In the
'Browse by Date' search, mousing over a headline on the
right-hand side outlines the related article in the large
thumbnail image. Clicking on either the headline or
within the outlined area will deliver the article selected.
To find out more about The
Times Digital Archive, visit
http://www.galegroup.com/
Times/.
PANDORA, A
USTRALIA
'
S
W
EB
A
RCHIVE
,
AND THE
D
IGITAL
A
RCHIVING
S
YSTEM THAT
S
UPPORTS IT
M
ARGARET
E. P
HILLIPS
D
IRECTOR
D
IGITAL
A
RCHIVING
,
N
ATIONAL
L
IBRARY OF
A
USTRALIA
THE ROLE OF NATIONAL LIBRARIES IN
WEB HARVESTING
T
he primary role of a national library
is to develop and maintain compre-
hensive collections of documentary mate-
rial relating to its country and its people, to
preserve them and to make them available
for research now and in the future. During
the last ten years, national libraries and
other `memory' institutions have had some
difficult decisions to make regarding the
extension of this role to electronic formats.
N
ational libraries have accepted
responsibility for online publications
and a small number have embarked on
programmes to develop collections, or
`archives', of them. Having accepted the
responsibility, proceeding with it has been
National Library of Australia
The Times
Digital
Arc
hiv
e
,
2003
The Times
Digital
Arc
hiv
e
,
2003
HA
TII (UofGlasgow),
Seam
us Ross
,
2003