• research_text_lab

research_text_lab

Classification of learner essays by achieved proficiency level

Classification of learner essays by achieved proficiency level

Goal

Developing an algorithm (web services) for automatic classification of Swedish learner essays by their reached proficiency level.


Background

Suggested approach would be to use machine learning for essay classification. The challenge is to identify features that would be both aware of the Second Language Acquisition (SLA) research and informative of the task at hand.

The classification will be made in terms of the levels of proficiency according to the Common European Framework of Reference (CEFR), which covers 6 learner levels: A1 (beginner), A2, B1, B2, C1, C2 (near-native). At the moment we have electronic corpora of essays at levels B1, B2, and C1. Essays at A2 are hand-written and haven't yet been digitized and annotated (which presumingly can be done in time for the project, if someone picks this topic).


Problem description

The steps for this project would include:

  • background reading on the topic of SLA, CEFR, essay grading and learner essay classification by levels. See one example for Swedish essay grading (NOT in terms of levels, but in terms of grades, i.e. (Väl/Icke) Godkänd: http://www.ling.su.se/english/nlp/tools/automated-essay-scoring
  • testing approaches for the best-performing classification
  • implementation of web service(s) for learner essay classification
  • (potentially) implementation of Lärka-based user interface where new essays can be tested
  • (potentially) evaluation of the results with teachers & new essays


Recommended skills:

  • Python
  • jQuery
  • interest in machine learning


Supervisor(s)

  • Elena Volodina/Ildiko Pilan
  • potentially others from Språkbanken/FLOV

Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises (2016)

Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises

Goal

Find a way to make sure that distractors in multiple-choice activities are genuine (i.e. cannot be used instead of the correct answer) in the context of a sentence/exercise item. This is primarily aimed at Swedish, but other languages are possible candidates as well.


Background

Multiple-choice items for training vocabulary knowledge is a well-documented format of exercise. However, when it comes to the automatic generation of this exercise type, it becomes a complicated problem to select genuinely appropriate distractors. For example, if a learner wants to train vocabulary from the topical domain of “Medical services and SOS”, answer options from the same topical domain might be generated as follows:


Parents couldn't afford to buy the necessary _________.

Choices: pincers, medicine, tablets, blood, hospital, nurse (correct answer in bold)


More than one alternative from the example above can be used to fill the gap (i.e. pincers, medicine, tablets). However, it is important to be able to select such distractors that cannot replace the correct answer, semantically or collocationally viewed, in the context of a sentence, e.g. in the case above to suggest choices: medicine, blood, hospital, nurse, emergency room


Problem description

The aim of this work is thus:

  • to study the literature on the topic of distractor selection, lexical semantics and context modeling
  • implement/test some approach(es) for a semantically aware selection of distractors
  • embed the selection algorithm into Lärka as a web service
  • evaluate/test on users (language learners, teachers, linguists, etc.)


Recommended skills:

  • Python
  • interest in Lexical Semantics


Supervisor(s)

  • Elena Volodina, Ildiko Pilan
  • potentially others from Språkbanken/FLOV

OCR error correction and segmentation

Master thesis project: OCR error correction and segmentation

Goal: Creating better quality text by finding good methods for OCR error correction and text segmentation.

Background: Long term textual archives, spanning over decades or centuries, have the potential of answering many interesting research questions. One can look for changes in language and culture, author influence, writing standards among others. One difficulty in working with historical documents is the quality of the data. Many long term archives contain scanned documents from times when the used OCR technology was far from perfect. In addition, the quality of the documents leaves much to wish for. Due to the vast amount of time needed for this manual procedure, the OCR is not redone. Instead the errors should be corrected using rule-based systems and machine learning techniques. Beyond the problems with OCR, many co-occurrence statistics requires a definition of sentence or paragraph, which in its self can be very challenges in certain times.

Are You interested in working with a large digitized textual archive and develop techniques for correcting OCR errors for documents over 200 years old? Want to find ways to define sentences and paragraphs? Want to be part of a research team working with a new exciting research topic?

Problem description: Applying rule-based and statistical machine learning techniques to improve the quality of a large newspaper archive. Improvements will later be used by Språkbanken and reflected for the archive users.

Recommended skills: Interest in rule-based systems and statistical machine learning techniques Some mathematical background Programming skills A highly motivated 5th year student.

Supervisors: If you are interested or have questions, please contact one of the following: Nina Tahmasebi, Phone:031-786 6953, Email: HYPERLINK "mailto:nina.tahmasebi@gu.se" nina.tahmasebi@gu.se

Text categorization by topics (2016)

Text categorization by topics

Goal

Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.

Background

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.


Problem description

The aims of this work include the following:

  • to study literature on topic modeling
  • to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
  • apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.


Recommended skills:

  • Python, (maybe R)


Supervisor(s)

  • Rickard Johansson/Elena Volodina
  • potentially others from Språkbanken

Part-of-speech tagging/syntactic parsing of emergent texts

Goal

The goal of this project is to implement a part-of-speech tagger and investigate the possibilities of developing a syntactic parser that could handle emergent text, i.e. texts – or representations of texts – that are being produced (and thus frequently changed) in order to identify the syntactic location of for example pauses.

Background


In research on language production, pauses and revisions are generally viewed as a window to the underlying cognitive and linguistic processes. In research on written language production these processes are captured by means of keystroke logging programs that records all keystrokes and mouse movements and their temporal distributions. These program generates vast amounts of data which are time consuming to analyse manually. Thus a part-of-speech tagger that could handle emergent text would be of utter importance for quantitative analyses of large language production corpora. Naturally, a syntactic parser would add even more value.

Problem description

To develop an HMM-tagger for emergent texts (primarily in Swedish,  but English texts could also be made available)

To investigate the possibilities of implementing a discourse-based incremental parser for emergent texts and if possible implement it.

Recommended skills:

Good programming skills.

Supervisors:

Richard Johansson and Åsa Wengelin

Spelling games

Goal

Developing an algorithm (web service(s)) for automatic generation of exercises for training spelling (primarily, for Swedish)

Background

The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. Vocabulary knowledge covers a broad spectrum of word knowledge, spelling and recognition in speech being some of them.

Problem description

The aims of this work are:

  1. to implement web service(s) for adaptive spelling exercise generation using text-to-speech module for Swedish, where the target words/phrases will be pronounced фnd the student will have to type what he/she hears. If the user seems to be comfortable with word spellings, target words get longer, get inflected, or pronounced in phrases, sentences etc.
  2. analyze possible approaches to provide reasonable feedback
  3. implement user interface for the exercise to be used in Lärka
  4. create a database for storing all possible misspellings associated with each individual graphical word for future analysis (for better feedback or better adaptivity path)
  5. (potentially) evaluate the results.

Recommended skills:

  • Python
  • Query

Supervisor(s)

  • Elena Volodina/Torbjörn Lager
  • eventually others from Språkbanken/FLOV

Exercise generator for English or any other language available through NLTK

Goal

Developing python-based programs (web service(s)) for automatic generation of exercises (e.g. of the same type as in Lärka) for other languages than Swedish using corpora and necessary language resources/tools available through NLTK

Background

The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. To target other languages than Swedish we need access to annotated corpora and relevant resources/tools for other languages, which potentially are available through NLTK.

Problem description

The aims of this work are therefore:

  1. to implement web service(s) for exercise generation for English or any other language using NLTK, primarily of the same type as offered by Lärka (to avoid the hassle of implementing user interface); or eventually others exercise types.
  2. depending upon the type of exercises you choose to implement, a number of questions might arise, i.e. how to adapt exercises to relevant learner levels; which way to assign texts to appropriate language proficiency levels; how to select distractors, etc.
  3. (potentially) if other exercise types are chosen, the necessary user interface modules will need to be implemented.
  4. (potentially) evaluate the results.

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Markus Forsberg
  • possibly others from Språkbanken

Automatic text classification by its readability

Goal

Developing algorithm for automatic assigning of texts to relevant language learner levels (to be used in Lärka and eventually Korp)

Background

Text readability measures assign readability scores to texts according to certain features, like sentence and word length. These are not enough to fully estimate text appropriateness for language learners or eventually other user groups with limited abilities in a language. The recent PhD research at Språkbanken (Katarina Heimann Mühlenbock) has concentrated on studying different aspects of text with regards to text readability. However, no available implementation has been released.

Problem description

The aim of this work is thus:

  1. to study the above-mentioned PhD Thesis as well as a number of other research papers, and find a feasible implementation approach
  2. implement a program in python for automatic categorization of texts into CEFR levels
  3. implement a user interface for working with different text parameters (e.g. for switching them on/off)
  4. evaluate the results by comparing the classification results on a number of texts of known CEFR levels

Recommended skills:

  • Python
  • jQuery

Supervisor(s)

  • Elena Volodina/Katarina Heimann Mühlenbock
  • eventually others from Språkbanken

Automatic selection of (thematic) texts from web

Goal

Developing an algorithm for automatic collection of (Swedish) texts on specific topic from internet (as a part of Korp and/or Lärka)

Background

The currently developed application Lärka is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. The topic of the source texts is, however, not known. To be able to select authentic contexts of a relevant theme (as described in CEFR document, Common European Framework of References), we need an automated approach to selection of texts of a given theme, with all the subsequent annotations.

Problem description

The aims of this work include the following:

  1. to implement a python-based program (eventually web service(s)) for automatic selection of texts from the web, e.g. using so-called “seed words” (web-crawling approach). Face the possible problems with language identification, duplicates, noise, etc.
  2. test/evaluate programme performance by creating a domain corpus for Swedish taking CEFR themes as a basis for sub-corpora (Common European Framework of References).
  3. (potentially) compare performance of this programme with WebBootCat/Corpus Factory (via SketchEngine).
  4. (potentially) deploy the web-service in Lärka, i.e. implement the necessary user interface “module”  

Recommended skills:

  • Python
  • (potentially) jQuery

Supervisor(s)

  • Elena Volodina/Sofie Johansson Kokkinakis
  • possibly others from Språkbanken

Medication extraction from "Dirty data"

Aim

Dealing with spelling variation in Swedish medical texts with respect to names of drugs and related information, in order to improve indexing and aggregation. Extraction of information related to the medication is an important task within the biomedical area. Updating of drug vocabularies cannot follow the evolution of the drug development. Several methods can be used by e.g. combining internal and contextual clues.

The application will primarily based on "dirty" data (bloggs, twitter, logs) (and if necessary from scientific "clean" data for comparison).

Recommended skills

  • Don't have to be native speaker of Swedish, but some superficial knowledge of Swedish would be good to have.
  • Good programming skills

Supervisor(s)

Dimitrios Kokkinakis and possibly others from Språkbanken

References

Chen E, Hripcsak G, Xu H, Markatou M, and Friedman C. Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inform Assoc 2008;15(1):87–98.

Chieng D, Day T, Gordon G, and Hicks J. Use of natural language programming to extract medication from unstructured electronic medical records. In: AMIA, 2007:908–8.

Segura-Bedmar I, Martinez P, and Segura-Bedmar M. Drug name recognition and classification in biomedical texts. Drug Safety 2008;13(17-18):816–23.

Sibanda T and Uzuner O. Role of local context in deidentification of ungrammatical, fragmented test. Proceedings of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006), New York, USA. 2006.

X
Loading