background image
DigiCULT
.
Info
48
parser has missed any names the researcher
can request a view of the text, in which
the tagged names are highlighted.
I
t will be possible to add names that
were not found by the parser by mark-
ing them in this text view.This method
proved to be efficient: our first test resulted
in a percentage of 79.4% accuracy in
tokens that the parser suggested were
names.There were very few names that
were not recognised as such by the parser,
only five from a total of 61,152 in the
test text.
W
hen the researcher has decided the
tagging is sufficient, s/he can turn
to the tag view and take the set of tagged
types and tokens as input for further analy-
sis. In this respect, little has been done so
far. One of the first things we plan to do is
to add the possibility for the researcher to
group name types by normalised word-
forms, for instance to get nominatives and
genitives together (e.g. Albert and Albert's).
Furthermore, we would like to add results
of our deeper knowledge about the names
in the text, for instance mark certain
names as `speaking', when quotes from the
text under investigation can prove that the
author has made use of the etymological
meaning of the name to `build' the charac-
ter or the place he or she is describing.We
also want to be able to get results for the
their algorithms, texts and analysis results.
As such, researchers will be able to com-
bine and cross-examine their individual
result trees for further cooperative analysis.
HOW DOES AUTONOM WORK?
T
o analyse the names in a (literary)
text a researcher provides scanned and
OCRed text. As the goal is not to present
an ideal and `clean' digital version of the
text, the researcher can leave the OCR
text uncorrected.The .txt version of the
digital text must then be uploaded to the
repository; this is very easy to do and all
procedures are carefully explained on the
Website.To avoid problems with copyright
we have taken care to ensure that only the
researcher who submits a document will
be able to analyse and view it.
T
he next step is to analyse the
uploaded text.The name parser pres-
ents the researcher with a list of all types in
the text that might be names. In this list,
which also shows the number of occur-
rences, the researcher can view all tokens
per type. Each word that is probably a
name is presented with some context, so
that it is very easy to check whether all
of the parser's suggestions are correct.
C
licking the save button fixes the
chosen interpretation of the token(s)
as a name or not.To check whether the
A
lthough at the moment we only pro-
vide an algorithm for parsing proper
names, the contours of the framework are
beginning to form. A researcher may
upload a text file to a personalised and pri-
vate repository. From the repository the
text may then be analysed by applying an
algorithm of the researcher's choice.The
parsing algorithm returns an XML tree
representing the tokens in the text identi-
fied as `probable' proper names.The
researcher can approve or reject the sug-
gestions made by the algorithm and anno-
tate (i.e. tag) analysed items.The
`authorised' results are stored in a private
analysis repository. From this repository
result trees may be regained for future ref-
erence, editing, or for rendering a view on
the combined result trees and analysed
text. For these latter purposes we rely on
a `just in time trees' model (JITT)
48
that
allows us to preserve the original digital
text and to store multiple result trees in a
concurrent versioning system (CVS). By
realising concurrent versioning for analysis
results we intend to provide the researcher
with the possibility to recursively refine
the formal definition (and thereby the
algorithm) for the textual phenomena
sought within texts.The researcher should
in this way be able to reproduce, refine
and combine any prior algorithms used for
analyses. In addition, we want to provide
collaborative means for researchers to share
In this presentation all tokens by default have been marked as `proper name',
so the researcher only has to unmark the wrongly suggested words.
48 For more information about Just in Time Trees, see http://www.sbl-site2.org/Extreme2002/.
A
utonom,
2003
A
utonom,
2003