background image
DigiCULT
.
Info
20
and when the same tags appear multiple
times.Tagging individual fields offers the
advantage that the data from different fields
can be addressed individually and that data-
base filters can be eliminated.The use of
abbreviations and codes as tag names can
offer a partial solution, but this does mean
that documentation on the meaning of tag
names needs to be kept. Because of this,
XML documents lose part of their autono-
my and self-reliance.
METADATA
A
t the moment of archiving, all of the
metadata of the records and of the
information system are archived as well. By
adding as many semantic XML tags to the
database fields and records as possible,
important metadata in the record itself is
archived. Because of this, there is no need
for external documentation in order to
find out the function of characters in cer-
tain positions within the database record.
We keep depending on external documen-
tation to discern the meanings of such
abbreviations as tag names or code tables.
Essential metadata on the information sys-
tem are present in the information system
inventory.This metadata is exported to an
XML document and further replenished
with metadata at the moment when an
archiving action is undertaken.This XML
document is archived at the same time as
the records.
T
he DAVID project highlights the
importance of understanding the
overall structure of an information system
in order to design methods of record keep-
ing which are appropriate to the data
stored and which will maintain high quali-
ty of both metadata and records for the
long term.
T
he migration process of records with-
in databases to XML documents con-
sists of several steps.The first step involves
editing a document model for the records.
Initially, DTDs were developed for this, but
gradually the switch is being made to
XML schemas.This document model is
based mainly on the inherent structure of
the documents. It can be identical to the
internal database record structure, but this
is not essential. In relational databases the
record is often spread over several tables, so
joins and queries usually precede the
unload or export of the data. One guide-
line when putting together a document
model is to consider the way in which the
input, and especially the output, was pre-
sented to the user of the active informa-
tion system.
T
he mapping of a relational data model
to a hierarchical document model is
not always obvious. Both data models have
a number of fundamental differences. It is
possible to put the internal logic of the
documents in the archived documents by
assuring good nesting and by attributing
semantic tags.Therefore exporting data
from the database often requires query and
join actions. If necessary, stylesheets can be
used to demonstrate more explicitly the
way in which the records were shown to
the users of the active system.
T
he process of unloading databases
through text files deserves special
attention because of the encoding of the
characters. First, the characters are trans-
ferred, preferably to Unicode.The next step
is to replace the preserved XML characters
by entities and to filter out the invalid
XML characters (for instance control char-
acters).This is achieved with a tool specifi-
cally developed for the purpose. Finally, the
last step is tagging the XML characters and
adding the XML declarations.When
choosing the tag names it is advisable to
choose semantic tags, even though this can
lead to redundancy when using large files
the archival criteria, but also on the tech-
nological demands to reconstruct the
records in an authentic way in the future.
T
he decision on the file format and
the medium that will be used is
made at the moment of record-keeping.
The city archive of Antwerp has set its
archiving standards to the file formats,
media and file systems recommended in
official guidelines (which are based on
the general guidelines and best practices
of the DAVID project). The creator and
the IT department of the city prepare the
transfer of the records together. The
records that have been deposited and
their carriers are inspected on arrival at
the archival service. To validate large
XML documents, a validation parser pro-
gramme was developed. CD quality is
tested with the aid of a diagnostic tool; if
the deposition does not meet the quality
demands set by the city archives, then the
electronic records are sent back to the
creator so that they can be brought up to
the appropriate level. Two copies of each
carrier must be deposited: one copy is
kept at the archival service while the
safety (backup) copy is stored at a sepa-
rate location. Only once the transfer has
been approved is it permissible to remove
the records from the original information
system.
XML PRESERVATION OF RECORDS
T
he eXtensible Markup Language
(XML) is being used as much as pos-
sible as the preservation format when
archiving databases with textual data.
XML offers interesting benefits for elec-
tronic record-keeping of records: easily
exchangeable, appropriate for structured
textual information, application of an
explicit document model, self-describing
to a large extent. For databases that con-
tain Binary Large Objects (BLOBs a
large block of data, such as an image or
sound file), XML is used as metadata for-
mat for archived records.
More information on this archiving sys-
tem and on the DAVID project is avail-
able on the Website:
http://www.antwerpen.be/david or by
emailing: david@stad.antwerpen.be
BACK TO PAGE 1