Clustering corpus paragraphs for lexical differentiation


.Developing and evaluating a system for clustering corpus paragraphs in order to differentiate word usages in the corpus


Determining the range of usages for a particular word in a corpus is a great challenge. Particular aspects of this problem are investigated under the headings word sense disambiguation and word sense induction. In Språkbanken <http://språkbanken.gu.se>, the focus is on developing language-aware tools to aid us in building lexical resources, such as the Swedish FrameNet and Swesaurus (a Swedish wordnet).

Problem description

The paragraph is the smallest content unit of a text. The project aims at classifying/clustering paragraphs in a corpus in a way which makes it likely that the same lemma occuring in paragraphs of the same class (in the same cluster) will reflect the same sense of this word. This will allow us to design a corpus search interface where such hits are collaped by default and potentially different senses can be highlighted.

The work should preferably be carried out on the Swedish SUC corpus, but some suitable English corpus could be used instead, e.g., in the framework of NLTK. Many relevant tools are available in Java; hence, a familiarity with Java is necessary.

Recommended skills

  • Fair linguistic analysis skills in the target language
  • Good programming skills, including familiarity with Java


Lars Borin and possibly others, Språkbanken

Page updated: 2012-11-26 23:39

