Dealing with spelling variation in Swedish text in order to improve lemmatization, part-of-speech tagging and parsing.


Språkbanken <http://språkbanken.gu.se> uses an in-house large lexical resource cum morphological analyzer, plus an off-the-shelf part-of-speech tagger and dependency parser to annotate its online corpora. These tools expect standardized spellings in the texts to be analyzed (although the data-driven tools – the POS tagger and parser – will handle out-of-vocabulary items which are not recognized by the morphological analyzer).

Problem description

Many of the texts in Språkbanken also sport non-standard spellings, either because they represent a pre-standardization language stage – medieval and 17th century texts – or because they are full of spelling errors and variants, which often is the case with modern blog texts. The problem consists in developing and implementing a (partial) solution for discovering and dealing with the spelling variation in modern texts (for which we already have sufficiently large-scale language analysis tools). Preferably the solution should be general and extensible to other text types. The work thus includes a good deal of linguistic analysis of lemmatizer, POS tagger and parser output.

Recommended skills

  • Good knowledge of Swedish grammatical analysis
  • General familiarity with POS tagging and parsing
  • Good programming skills


Lars Borin and possibly others, Språkbanken

