background image
DigiCULT
.
Info
23
article on the page and the OCR process
provides the co-ordinates of every word
within each clip or article. In this way
articles can be extracted `on the fly' from
full-page images and the selected search
terms highlighted on the delivered image.
This approach obviates the need to store
both full pages and clips on the image
servers. The images, clipped images and
XML containing the OCRed text and
parameter-driven tools that we
have developed in-house. Pages
that were badly cropped prior
to microfilming are manually
cleaned and the title and page
number electronically inserted.
These processes not only make
the final image look more
attractive, but improve the
accuracy of OCR on the text
and reduce the overall file size
for delivery over the Internet.
Cleaned images generally range
in size between 600 Kb and 1.6
Mb depending on content.
Large display adverts or picture
pages with a high black content
approach the higher end of the
range with some occasionally
reaching 3 Mb in size.
T
he page images are deliv-
ered on DLT (Digital Linear Tape)
to our contractors in India for article clip-
ping, OCR, and the creation of fielded
metadata made up of publication name,
year, date, issue number, page number, title,
subtitle, author, column position and illus-
tration type. A category type is also added
(e.g. `Letters to the Editor' or `Sport') to
further segment the material and help end-
users better limit their searches. Clipping
the page into articles generates co-ordi-
nates or positional information for each
R
EG
R
EADINGS
,
P
RODUCTION
D
IRECTOR OF
T
HOMSON
G
ALE AND
M
ANAGING
E
DITOR FOR
T
HE
T
IMES
I
NDEX
,
TALKED
D
IGI
CULT.I
NFO
THROUGH THE PROCESS OF BRINGING
`T
HE
T
HUNDERER
'
TO THE
W
EB
.
A
fter two years of collective effort in
Reading (UK), India and the USA,
the digitisation of The Times from 1785 to
1985 is nearing completion.This article is
a very brief account of Thomson Gale's
(http://www.galegroup.com/) endeavours
to create a searchable backfile of the oldest
continuously published daily newspaper in
the English language. Every headline, arti-
cle and image, every front page, editorial,
birth and death notice, advertisement and
classified ad that appeared within the pages
of The Times (London) is now easily
accessible a complete virtual chronicle of
history for this period from the very first
stirrings of the French Revolution to the
release of Microsoft Windows, from the
reckless driving of stagecoaches to motor-
way madness.
T
he digitisation process starts with our
two Mekel M500 greyscale scanners
and 1,250 reels of 35 mm microfilm.The
individual pages are captured in greyscale
and saved as 300 dpi bi-tonal TIFFs at a
rate of six frames a minute. By carefully
controlling the duplication of the micro-
film masters and applying our own algo-
rithms for gamma correction (light
balance), noise removal and edge enhance-
ment, we have been able to generate high-
quality images for good OCR results.The
images from the scanners are then de-
skewed, cropped and de-speckled using
D
ELIVERING
T
HE
T
IMES
D
IGITAL
A
RCHIVE
Figure 1: Keyword Search screen for
The Times Digital Archive
Figure 2: Sample Results screen. Showing the context
in which an article was originally published (the article
is highlighted in the thumbnail to the left) came high
on the 'wish list' of project advisers.
The Times
Digital
Arc
hiv
e
,
2003
The Times
Digital
Arc
hiv
e
,
2003