background image
DigiCULT
.
Info
47
combined with our expectation that, if
those words really are names, they will
(with a certain probability) not appear in
the same text without the first character
capitalised. So we expect to find `Simon'
in a text, but not `simon', but we do
expect to find `Table' as well as `table'.
Therefore, to identify all probable proper
names in a text under investigation, we
simply return all tokens that have an initial
capital but do not appear in the same text
uncapitalised.With the Autonom frame-
work we intend to provide (and in part
already do provide) an online platform for
individual researchers to express such rule-
like features or phenomena in texts.The
platform will convert formalised textual
features described by the researcher into
rule-based algorithms for text parsing.
Such algorithms may then be applied to
any text available to the researcher within
a repository.
W
ithout the help of
a name parser this
would be a time-consum-
ing and error-prone job,
which may explain why
this method has not been
used much until now.The
name parser we have
developed and the Web site
through which it can be
used will make this type
of name research a lot easi-
er.We expect it will even-
tually yield much more
concrete and reliable results
than onomastic research of this type has
done to date. Here we will describe the
first steps we have taken towards a Website
in which researchers from all over the
world can analyse names and other features
in the texts of their own choice.
TECHNICAL BACKGROUND
A
utonom (http://autonom.niwi.
knaw.nl) is a Web application that
provides a framework for computer-assist-
ed analysis of texts. Its special focus is liter-
ary texts. At the moment Autonom does
not implement any machine learning sta-
tistical approaches, because such approaches
require large training corpora that may not
always be available or appropriate to the
literary field in question. Rather we adopt-
ed a rule-based (or finite state) shallow
parsing techniques as a pragmatic solution.
We do, however, intend to strengthen this
technique by adding probabilistic feature
parsing We will use the growing number
of texts that are researched within the
Autonom framework as a training corpus.
F
irst, we developed the name parser.
The rule we use within the parser to
identify proper names is our knowledge of
the fact that in Dutch it is customary to
write proper names with an initial capital,
K
ARINA VAN
D
ALEN
-O
SKAM
&
J
ORIS VAN
Z
UNDERT
N
ETHERLANDS
I
NSTITUTE FOR
S
CIENTIFIC
I
NFORMATION
S
ERVICES
(NIWI-KNAW)
D
EPARTMENT OF
D
UTCH
L
INGUISTICS AND
L
ITERARY
S
TUDIES
INTRODUCTION
O
ne of the research topics of the
Department of Dutch Linguistics
and Literary Studies of the Netherlands
Institute for Scientific Information
Services (NIWI-KNAW) is `names in lit-
erary texts'. Names can be studied from
different perspectives: from linguistic, lit-
erary or onomastic (`name studies') points
of view. Areas for investigation include:
how authors use names to distinguish
between entities (persons, places, etc);
which types of names are used and what
their frequencies are; how authors make
use of names to give their story a certain
atmosphere or to describe the characters
or places in their work; and how authors
differ from each other in their use of
names. Central to this kind of research is
to consider all names in the texts under
investigation in the analysis.
F
INDING
N
AMES WITH THE
A
UTONOM
P
ARSER
First page of the novel De Middelburgsche
Avanturier (1760, author unknown), one of the
eighteenth-century novels that will be analysed
with [the] help of Autonom.
NIWI-KNA
W
,
2003
A
utonom,
2003