background image
DigiCULT 27
By Joost van Kasteren
Building digital collections of documents, imag-
es or music demands a lot of effort in analysing these
`objects', assigning metadata and designing ways of
accessing them. As the amount of electronic informa-
tion is growing exponentially, the build-up of digit-
al collections has not kept pace. `Why don't we do it
the other way around,' says Andreas Rauber, `and put
objects in an information space, like a librarian puts
books on a certain topic on the same shelf. But then
Rauber is associate professor at the Vienna Univer-
sity of Technology, Department of Software Technol-
ogy and Interactive Systems.
One of the projects he
is involved in is the SOMLib Digital Library, a library
that automatically clusters documents by content.
Rauber: `Retrieving useful information from digital
collections is rather difficult. You have to formulate a
query, specifying large numbers of keywords, and the
result is often a list of thousands of documents, both
relevant and irrelevant. Compare this with a conven-
tional library or a large bookshop and the way we
approach the stored information. Books are organised
by topic into sections, so we can locate them easily. In
doing so, we also see books on related topics, which
might be useful for us.'
The conventional library system inspired Rauber
and his co-workers to develop a digital library along
the same lines, based on the self-organising map, the
SOM in SOMLib. The SOM is an unsupervised neu-
ral network that can automatically structure a doc-
ument collection. It does so by vectoring full text
documents and then classifying them according to
content. Rauber: `Our SOM does not focus on spe-
cific words, so documents about river banks are
separated from documents on banks as financial insti-
tutions and from documents about banks to sit on.'
Being a neural network, it has to be trained by pre-
senting input data in random order. Gradually the
neural network is fine-tuned until it clusters most of
the documents correctly. Rauber: `Sometimes you get
a strange result. One time we presented the text of a
stage play to our SOM and it ended up in the news
section. When we took a closer look, it came out that
it was a very realistic play.'
To handle large document collections they have
developed a hierarchical extension, the GHSOM
(Growing Hierarchical SOM), which results in some-
thing like an atlas, says Rauber. `An atlas of the world
contains maps of the continents, of the countries on
these continents and of regions and sometimes even
cities. In a GHSOM you go from an overview of the
different sections to an overview of the different com-
partments in a section to the topics within that com-
The search has been made easier by adding a user
interface, LibViewer, which combines the spatial
organisation of documents by SOMLib with a graphi-
cal interpretation of the metadata, based on Dublin
Core. Documents are assigned a `physical' template
like hard cover or paper and further metadata such as
language, last time referenced and size of document.
The metadata have a graphical metaphor so it seems
as if you are standing in front of a bookcase, with all
kinds of different books. Some thick, some thin, some
green, some yellow and some look as if they have
been extensively used, and others look brand new.
The SOM can also be used for `objects' other
than documents. Rauber has developed the SOM
enhanced Jukebox (SOMeJB), which is built upon the
same principles. It is used to organise pieces of music,
for instance mp3 files, according to sound character-
istics. Rauber: `The result is a music collection organ-
ised in genres of similar music. So you get one cluster
of heavy metal, one of classical music and one of Lat-
in music. It differs from standard repositories of music
in that you do not have to access through text meta-
data, like name of the artist or title of the song. You
might use it even to search for songs of a certain gen-
re on the Internet.'