Customize

1. You can enlarge the whole site (character size and with) by using the browser function to change characters size.

2. To your right it is possible to change the character size, font, spacing, characters and letters as well as adjust the colours. This will have consequences for the appearance of the whole website design. It will effect all pages at the University  of Gothenburg's website. The changes will remain the next time you log in. (To save your changes the browser must allow cookies.)

*Changes has been made to the look of this website


  • Home
  • Text categorization by topics (2016)

Text categorization by topics (2016)

Text categorization by topics

Goal

Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.

Background

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.


Problem description

The aims of this work include the following:

  • to study literature on topic modeling
  • to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
  • apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.


Recommended skills:

  • Python, (maybe R)


Supervisor(s)

  • Rickard Johansson/Elena Volodina
  • potentially others from Spr√•kbanken
To the top

Page updated: 2016-11-15 11:10

Send as email
Print page
Show as pdf

X
Loading