• Home
  • Automatic selection of (thematic) texts from web

Automatic selection of (thematic) texts from web


Developing an algorithm for automatic collection of (Swedish) texts on specific topic from internet (as a part of Korp and/or Lärka)


The currently developed application Lärka is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. The topic of the source texts is, however, not known. To be able to select authentic contexts of a relevant theme (as described in CEFR document, Common European Framework of References), we need an automated approach to selection of texts of a given theme, with all the subsequent annotations.

Problem description

The aims of this work include the following:

  1. to implement a python-based program (eventually web service(s)) for automatic selection of texts from the web, e.g. using so-called “seed words” (web-crawling approach). Face the possible problems with language identification, duplicates, noise, etc.
  2. test/evaluate programme performance by creating a domain corpus for Swedish taking CEFR themes as a basis for sub-corpora (Common European Framework of References).
  3. (potentially) compare performance of this programme with WebBootCat/Corpus Factory (via SketchEngine).
  4. (potentially) deploy the web-service in Lärka, i.e. implement the necessary user interface “module”  

Recommended skills:

  • Python
  • (potentially) jQuery


  • Elena Volodina/Sofie Johansson Kokkinakis
  • possibly others from Språkbanken
To the top

Page updated: 2014-11-12 15:10

Send as email
Print page
Show as pdf