• Home
  • Parsing the PROIEL syntactic formalism for historical treebanks

Parsing the PROIEL syntactic formalism for historical treebanks

In the context of the project Pragmatic Resources for Old Indo-European Languages (University of Oslo, 2008 - 2012), a formalism and guidelines for the annotation of parts-of-speech, morphology and dependency syntax in several historical Indo-European languages have been developed, such as Ancient Greek, Latin, Old Church Slavonic and Gothic. This formalism and (derived) guidelines have by now been used for 18 different languages / language stages. This includes Old Swedish as part of the project Methods for the automatic Analysis of Text in digital Historic Resources, which currently runs at the Department of Swedish at GU.

Syntactic annotation comes in the form of dependency trees, with additionally the use of so called secondary edges to capture argument sharing and the use of empty tokens to capture ellipsis of verbs and coordinators. As far as we are aware, no one has attempted to statistically parse the PROIEL format. In this project, you will investigate how to parse this format and construct a working set up so that anyone with a PROIEL corpus can come and train a parser on their annotated data.

There is a fair amount of training material available to train parsers on, a selection of which can be used in the MA project. In total, the number of annotated tokens with some kind of PROIEL annotation is well over a million, with the largest languages having over 100k tokens of annotated data.

An important part of the project is to investigate how to tackle the fact that the PROIEL syntactic format is not a classic dependency tree -- are there existing statistical parsers that can handle a format like PROIEL's, can you adjust/extend an existing parser, or could you handle the PROIEL format with a standard parser and some pre- and post-processing? You will also evaluate your final solution's performance on some of the existing annotated material. The historical material also has other challenging features for parsing, such as non-standard orthography, a lack of uniform sentence marking and morphologically rich language, but these issues are not intended to be the focus of the project.

This MA project combines theory (forming an understanding of the challenges of the syntactic format for statistical parsing), literature study (surveying the field for existing/adaptable solutions), implementation, and empirical research (evaluation of the final system(s)).

A working solution of decent quality is a publishable result and will be of practical interest of creators of new PROIEL treebanks, as a parser could be used to support future manual annotation efforts.

Programming skills and an NLP background are a prerequisite, as is some knowledge of statistical methods. Affinity with the linguistic side of processing is a plus as this will for instance allow you to do more insightful error analysis.

The project would be supervised by Gerlof Bouma, Yvonne Adesam or possibly others at Språkbanken.

To the top

Page updated: 2015-11-24 15:17

Send as email
Print page
Show as pdf