background image
The XML Family
of Technologies
43
hence HTML markup is more concerned with delivering and displaying content than
with representing its semantics. Purely descriptive markup, on the other hand, allows the
same document to be processed by many different software applications, and for different
applications to use data in different ways.This ensures flexibility: information can be used
for multiple purposes.The downside is that HTML content creators have less control
over what their work will look like when it is displayed: as many readers will know, a
Web site may look very different on Netscape, Opera and Internet Explorer owing to their
fundamental differences, and variation in degrees of support for proprietary tags.
38
SGML introduced the notion of a document type, which must be defined at the outset.
A document is defined by its constituent parts and their hierarchical structure.Through
the use of carefully structured documents, XML allows content to be reused and repur-
posed over different formats, programs, and platforms. If we know a document's type
then we know what individual parts it must (or may) contain and in what order they are
likely to appear.This allows us to use parsing programs to extract the information in
which we are interested. It is also possible to make judgements as to whether particular
documents are of a certain type, and whether other documents of the same type can be
processed in a uniform fashion.
XML's explicit focus on the Web has been geared to overcome the increasingly evi-
dent shortcomings of HTML, most notably its fixed tagset and its inability to treat con-
tent as data.This first difficulty has been exacerbated by the fact that different browser
vendors have allowed the use of proprietary tags, thus making certain Web pages very
attractive in one browser but completely unusable in another. Apart from the differentia-
tion between the <head> and <body> sections of a document, HTML does not support
the structuring of documents.While we may search HTML documents for occurrences
of text strings, content creators cannot define what the text we find is intended to mean.
Specific user communities may wish to formulate their own vocabularies and tagsets
for easy and usable exchange of materials between remote parties: the European
Parliament has already done so with ParlML.
39
The variety of possible tagset uses is not
restricted, for example publishing companies may wish to communicate using the tag
<title> to indicate the title of a book, whereas genealogists may prefer <title> to mean
the word or words that precede a person's name.This might cause difficulties when a
book on genealogy is being published, but these problems can easily be overcome, as we
will see later in this section.
38
Originally an SGML application, HTML has recently been rewritten as an XML application called
XHTML (http://www.w3.org/TR/xhtml1/).
39
http://www.europarl.eu.int/docman/texts/TFDM(2000)0014EN(TOC)0.htm
TWR2004_01_layout#62 14.04.2004 14:07 Uhr Seite 43