Swedish multiword-entity extraction


.Finding long (>2 words) word/lemma n-grams


Multiword entitites have received much attention lately both in computional linguistics and in general linguistics. There is a long tradition in computational and corpus linguistics of mining multiword entities from text by applying (a wide range of) collocation measures to pairs of entities (text words, lemmas, syntactic dependencies), contiguous or non-contiguous, in order to find two-word lexical units or terms. Attempts to discover longer units are much more rare in the literature, in part because good collocation measures seem to be lacking for this problem.

Problem description

The aim of this work is to refine a purely frequency-based way of finding contiguous word n-grams in annotated text, for instance by applying methods from work on automatic word segmentation. The preferred target language is Swedish. English is also acceptable, but in this case, Språkbanken can provide only limited support wrt annotation tools and linguistic expertise.

Recommended skills

  • Good knowledge of linguistic analysis
  • General familiarity with POS tagging and parsing
  • Familiarity with machine learning
  • Good programming skills


Lars Borin and possibly others, Språkbanken

