• Home
  • Text categorization by topics (2016)

Text categorization by topics (2016)

Text categorization by topics

Goal

Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.

Background

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.


Problem description

The aims of this work include the following:

  • to study literature on topic modeling
  • to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
  • apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.


Recommended skills:

  • Python, (maybe R)


Supervisor(s)

  • Rickard Johansson/Elena Volodina
  • potentially others from Språkbanken
To the top

Page updated: 2016-11-15 11:10

Send as email
Print page
Show as pdf

X
Loading