Developing an algorithm for automatic collection of (Swedish) texts on specific topic from internet (as a part of Korp and/or Lärka)
The currently developed application Lärka is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. The topic of the source texts is, however, not known. To be able to select authentic contexts of a relevant theme (as described in CEFR document, Common European Framework of References), we need an automated approach to selection of texts of a given theme, with all the subsequent annotations.
The aims of this work include the following:
to implement a python-based program (eventually web service(s)) for automatic selection of texts from the web, e.g. using so-called “seed words” (web-crawling approach). Face the possible problems with language identification, duplicates, noise, etc.
test/evaluate programme performance by creating a domain corpus for Swedish taking CEFR themes as a basis for sub-corpora (Common European Framework of References).
(potentially) compare performance of this programme with WebBootCat/Corpus Factory (via SketchEngine).
(potentially) deploy the web-service in Lärka, i.e. implement the necessary user interface “module”
Elena Volodina/Sofie Johansson Kokkinakis
possibly others from Språkbanken