background image
DigiCULT
.
Info
21
W
e propose two complementary
ways of producing these annota-
tions: automatically with document image
analysis and collectively, on the Internet,
with the help of the readers during their
use of the image. We present a platform
developed to manage collective annota-
tions built on automatic annotations. We
show application examples on various doc-
uments: civil status registers, military forms
of the nineteenth century and naturalisation
decrees. For each document we present
the automatic annotations that we are able
to produce with DMOS (Description and
MOdification of Segmentation), a generic
document recognition method we devel-
oped.12 We also present the collective
annotations that can be added by users
with the help of the automatic annotation.
This platform offers a uniform interface
for accessing archival documents by
content.
PRODUCING ANNOTATIONS
Automatic Annotations
Automatic annotations are produced by
optical document recognition (OCR).
On printed and recent documents, exist-
ing OCR systems are able to recognise
almost all of the text and can then be used
to build textual annotations for document
retrieval. On archived and old documents,
it is more difficult because documents
can be damaged (tears, blots, tape repairs,
smudges), ink on the reverse side of the
paper may bleed through to the front of
the page, stamps can be affixed and many
other features also create problems for
OCR. To automatically produce anno-
tations on poor quality documents with
handwritten text, with many different writ-
ers, on large vocabulary without diction-
aries, it is necessary to first locate where
the information for retrieval is within the
image. This allows us to detect what part
of the image contains handwriting and
to begin work on handwriting
recognition.
Geometric Annotations
To be able to detect the position of specif-
ic handwritten text, it is important that the
document is structured, for example docu-
ments like forms or tables. We can work
on less structured documents if handwrit-
ten text is graphically structured, e.g. with-
in margins. Literature shows that structured
document recognition systems are difficult
to develop and systems are usually specific
to one kind of document. A new kind of
document often means a complete devel-
opment of a new recognition system. This
is why we proposed DMOS, a generic rec-
ognition method for structured documents.
This method involves the new grammatical
Enhanced Position formalism (EPF), which
can be seen as a description language for
structured documents. EPF makes possi-
ble graphical, syntactical or even semantic
description of a document. DMOS con-
tains the associated parser which is able
to change the parsed structure during the
parsing. This allows the system to try other
segmentations with the help of context to
improve recognition; it is an automatic gen-
erator of structured document recogni-
tion systems. Adaptation to a new kind of
document is achieved by simply defining
a description of the document with EPF
grammar. This grammar is then compiled to
produce a new structured document recog-
nition system. With this generator, we have
been able to produce various document
recognition systems: for musical scores,
mathematical formulae, and recursive table
structures. We could even use it to recog-
nise tennis courts in videos!
L
ater we present the application of
DMOS to various documents to auto-
matically produce geometric annotations
on the document structure.
Annotations on Handwritten Text
Once handwritten text is located within
the document with DMOS, it is possible to
analyse the handwriting. As we are general-
ly interested in last names, it is impos-
sible to use dictionaries, as they cannot be
exhaustive. Moreover, these names are writ-
ten by many different writers. It is then
impossible to use handwriting recognition
methods on large vocabularies,13 or even
word spotting.14 We propose to extract (by
image analysis) a grapheme representation of
handwritten names (see Figure 1). This rep-
resentation is stored as textual annotations.
T
his produces automatic indexing of
handwritten names. A user will be
able to make a textual request which is
translated into a grapheme representation.
Using edition distance, this representation
is compared with all indexes to select the
names according to the request.
Later we present the first results of auto-
matic access of handwritten names in mili-
tary forms from the nineteenth century.

12 See also: B. CoŁasnon, `DMOS: A generic document
recognition method, application to an automatic genera-
tor of musical scores, mathematical formulae and table
structures recognition systems' in ICDAR, International
Conference on Document Analysis and Recognition, pages 215≠
220, Seattle, USA, September 2001, and B. CoŁasnon,
`Dealing with noise in DMOS, a generic method for struc-
tured document recognition' in IAPR International Workshop
on Graphics Recognition, Barcelona, Spain, July 2003.

13 Such as is described in A. Vinciarelli, S. Bengioand
H. Bunke, `Offline recognition of large vocabulary cur-
sive handwritten text' in Proceedings. of the 7th International
Conference on Document Analysis and Recognition, volume 1,
pages 1101≠1105, Edinburgh, Scotland, August 2003.

14 See R. Manmatha and W. B. Croft, `Word spotting:
Indexing handwritten archives' in M. Maybury, editor,
Intelligent Multi-media Information Retrieval Collection. AAAI/
MIT Press, 1997, and T. M. Rath and R. Manmatha, `Word
image matching using dynamic time warping' in
Proceedings
of the Conference on Computer Vision and Pattern Recognition,
volume 2, pages 521≠527, Madison, USA, June 2003.
Figure 1: Automatic extraction of grapheme. Dots represent the beginning
of each grapheme.
©
IMADOC
,

2004