Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.
A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.
Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.
The aims of this work include the following: