• Home
  • Extra seminar: Christian Chiarcos – A massively parallel diachronic corpus of the Germanic languages: Creation and initial experiments

Extra seminar: Christian Chiarcos – A massively parallel diachronic corpus of the Germanic languages: Creation and initial experiments


This talk describes the creation of and initial experiments with a massively parallel corpus of diachronic Germanic.

The majority of parallel and quasi-parallel data on older Germanic languages is constituted by texts directly or indirectly based on the Bible. This includes actual translations, but also loose paraphrases, in prose or in verse, either as independent works (psalters, gospel harmonies, but also free adaptations in medieval romance), or as part of derived works (such as exegetic commentaries, sermons or chronicles). For several historical languages, most noteably Old Saxon and Old High German, Biblical text represents the majority of parallel data available at all, gospel harmonies represent even the majority of data currently known.

Still today, the Bible is the single most translated book in the world and not only available in a vast majority of world languages, but also a of dialects. The Lord's Prayer and the Tale of the Prodigal Son have been the basis for early studies on dialectology, and with the rise of the internet, home-grown dialectal translations of Bible excerpts, books or the full Old and New Testament have been developed and are circulating in digital form.

This amount of parallel data is of crucial interest to philologists and comparative linguists, and out of this context, aligned Bible corpora with morphosyntactic annotation have been developed at the Goethe University Frankfurt in the context of the project "Old German Reference Corpus" (2010-2014) and the LOEWE cluster "Digital Humanities" (2011-2014) for Old Saxon and Old High German, and complement the series of annotated Bibles currently available for Gothic, Middle English, and Middle Icelandic.

A massively parallel diachronic corpus of the Germanic languages is, however, not only a valuable resource for historical linguistics, but also relevant to current research in Natural Language Processing: The Germanic languages, with their great body of diachronic material, and their well-understood grammatical, morphological and phonological development provide us with a test bed to study the impact of diachronic relatedness on algorithms for historical-to-modern normalization, annotation projection or model transfer between related language stages. With this data, we can investigate, for example, the correlation between diachronic relatedness and the preferred method to derive NLP tools for less-resourced languages from tools for better-resourced languages.

Accordingly, the annotated Bibles mentioned above have been aligned with each other by the Applied Computational Linguistics Lab of the University Frankfurt, and augmented with a massive corpus of unannotated Bibles (Fig.1).

In this talk, I present the parallel corpus as a resource, I will issues with respect to coverage, availability, data quality, legal issues as well as initial results on
(i) usability for philological research,
(ii) alignment and annotation projection, and
(iii) normalization and hyperlemmatization.

Christian Chiarcos
Frankfurt Am Main, Germany

Date: 2015-02-11 13:15 - 15:00

Location: L308, Lennart Torstenssonsgatan 8


add to Outlook/iCal

To the top

Page updated: 2015-02-09 10:36

Send as email
Print page
Show as pdf