Customize

1. You can enlarge the whole site (character size and with) by using the browser function to change characters size.

2. To your right it is possible to change the character size, font, spacing, characters and letters as well as adjust the colours. This will have consequences for the appearance of the whole website design. It will effect all pages at the University  of Gothenburg's website. The changes will remain the next time you log in. (To save your changes the browser must allow cookies.)

*Changes has been made to the look of this website


  • research_text_lab

research_text_lab

Acquisition, correlation, visualization and use of lexico‚Äźsyntactic and semantic features, from Swedish transcribed interactions (2016)

The purpose of this project is to: 1. conduct a literature review in the area of feature extraction from dialogue data and spoken language transcriptions 2. implement a (large) set of lexico-syntactic and semantic features from the papers reviewed in the previous step 3. build or use an existing classifier by using the extracted features from the previous step. The application would be able to differentiate between transcribed spoken dialogues from dialogues in other corpora.

Parsing the PROIEL syntactic formalism for historical treebanks

In the context of the project Pragmatic Resources for Old Indo-European Languages (University of Oslo, 2008 - 2012), a formalism and guidelines for the annotation of parts-of-speech, morphology and dependency syntax in several historical Indo-European languages have been developed, such as Ancient Greek, Latin, Old Church Slavonic and Gothic. This formalism and (derived) guidelines have by now been used for 18 different languages / language stages. This includes Old Swedish as part of the project Methods for the automatic Analysis of Text in digital Historic Resources, which currently runs at the Department of Swedish at GU.

Syntactic annotation comes in the form of dependency trees, with additionally the use of so called secondary edges to capture argument sharing and the use of empty tokens to capture ellipsis of verbs and coordinators. As far as we are aware, no one has attempted to statistically parse the PROIEL format. In this project, you will investigate how to parse this format and construct a working set up so that anyone with a PROIEL corpus can come and train a parser on their annotated data.

There is a fair amount of training material available to train parsers on, a selection of which can be used in the MA project. In total, the number of annotated tokens with some kind of PROIEL annotation is well over a million, with the largest languages having over 100k tokens of annotated data.

An important part of the project is to investigate how to tackle the fact that the PROIEL syntactic format is not a classic dependency tree -- are there existing statistical parsers that can handle a format like PROIEL's, can you adjust/extend an existing parser, or could you handle the PROIEL format with a standard parser and some pre- and post-processing? You will also evaluate your final solution's performance on some of the existing annotated material. The historical material also has other challenging features for parsing, such as non-standard orthography, a lack of uniform sentence marking and morphologically rich language, but these issues are not intended to be the focus of the project.

This MA project combines theory (forming an understanding of the challenges of the syntactic format for statistical parsing), literature study (surveying the field for existing/adaptable solutions), implementation, and empirical research (evaluation of the final system(s)).

A working solution of decent quality is a publishable result and will be of practical interest of creators of new PROIEL treebanks, as a parser could be used to support future manual annotation efforts.

Programming skills and an NLP background are a prerequisite, as is some knowledge of statistical methods. Affinity with the linguistic side of processing is a plus as this will for instance allow you to do more insightful error analysis.

The project would be supervised by Gerlof Bouma, Yvonne Adesam or possibly others at Språkbanken.

Sub-corpus topic modeling and Swedish litterature

The goal of the Master thesis will be to: i) use/process a large Swedish text collection ii) experiment and apply topic modeling and consequently sub-corpus topic modeling (according the description by Tangherlini & Leonard, 2013) iii) adapt or create a visual, web based environment to explore the results (this will be done in various ways, preferably as a) network graphs (Smith et al., 2014); se for instance figure 1 and integrated them in b) a web based exploratory environment, such as a dashboard; se figure 2)

For more see the link below:

Attachments: 

A poor man's approach to integrating POS-tagging and parsing.

A poor man's approach to integrating POS-tagging and parsing.

In the by now traditional NLP processing setup, part-of-speech tagging and syntactic parsing are separate, ordered tasks. A sentence is first POS-tagged, after which the results are used as the input for parsing. This model is convenient because it allows one to use a more efficient technique for the "simpler" task of POS-tagging and it helps to keep search space in the expensive parsing task down.

On the downside, however, we note that a POS-tagger is missing out on possibly beneficial syntactic information -- POS-tagging precedes parsing and therefore syntactic information cannot be used to choose between alternative tag sequences. In turn, we can expect parsing to suffer from a resulting decreased accuracy in POS-tagging.

Indeed, in PCFG-based parsing, parsing and POS-tagging has long been one and the same processing step. More recently, in the data-driven dependency parsing literature, algorithms for combined parsing and POS-tagging have been proposed, and they have been shown to lead to improved results.

In this project, you will investigate a simpler, more general approach to integrating POS-tagging and parsing, by letting the POS-tagger and the parser entertain multiple hypothesis about the analysis of a sentence, from which the most best analysis can then be chosen. This way one can achieve a free flow of information between the two processes -- hopefully improving accuracy -- without having to radically change the NLP setup (POS-tagging still precedes parsing). Existing tools can be used with very little alteration, which makes it a poor man's solution.

As part of the project, you will investigate, design and implement different ways of realizing the setup sketched above, and present experiments showing the impact of your choices on analysis accuracy and efficiency.

This MA project combines theoretical aspects (literature study and design of a realization of the system outlined above), implementation, and empirical study (evaluation of the system).

Programming skills and an NLP background are a prerequisite. Knowledge of statistical methods is a big plus, as is affinity with the linguistic side of processing, as this will allow you to do more insightful error analysis. Since the development material will be Swedish text, some passive knowledge of the Swedish language is assumed.

The project would be supervised by Gerlof Bouma and Richard Johansson, Yvonne Adesam or possibly others at Språkbanken.

Building a sentiment lexicon for Swedish

The goal of this project is the semi-automatic construction of a sentiment lexicon for Swedish. For more information see link below.

http://spraakbanken.gu.se/eng/personal/richard/sentiment_dict_project

Adding valency information to a dependency parser

The goal of this project is to improve a Swedish dependency parser by integrating a valency lexicon. For more information see link below.

http://spraakbanken.gu.se/eng/personal/richard/valency_project

Part-of-speech tagging/syntactic parsing of emergent texts

The goal of this project is to implement a part-of-speech tagger and investigate the possibilities of developing a syntactic parser that could handle emergent text, i.e. texts ‚Äď or representations of texts ‚Äď that are being produced (and thus frequently changed) in order to identify the syntactic location of for example pauses. For more information see link below.

http://www.clt.gu.se/masterthesisproposal/part-speech-taggingsyntactic-p...

Historical Text reuse (in Swedish Literature) (2016)

The goal of the Master thesis will be to apply (implement or adapt) techniques (e.g., borrowed from the field of bioinformatics) to identify lexically-similar passages (i.e. phrases, sentences, quotes, paraphrases) across collections of Swedish literary texts. Such techniques can use any suitable algorithms for that purpose, but preferably sequence alignment and present/visualize the results in a user friendly (and navigable) way.

Collocations for learners of Swedish

Collocations for learners of Swedish

Goal

Generate a list of collocations, phrasal verbs, set phrases and idioms important for learners of Swedish, linked to proficiency levels, for use in Lärka.


Background

The currently developed application L√§rka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. L√§rka generates a number of exercises based on corpora available through Korp, one of them focusing on vocabulary. It has been mentioned on several occasions that we should include multi-word expressions into our exercise generator. This also complies with the CEFR ‚Äúcan-do‚ÄĚ statements at different levels of proficiency (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). It is, however, a non-trivial task to identify the items that should be included into the curriculum, and even more uncertain how the selected items can be assigned to different proficiency levels.


Problem description

The aims of this work are the following:

  • to study literature on collocations etc. in general and in the L2 context especially, paying special attention to the CEFR guidelines; to make an overview of the practices for training collocations etc. used in other applications and in (online) dictionaries/lexicons
  • to generate a list of collocations, (primarily) by automatic analysis of COCTAILL - a corpus of coursebook texts used for teaching Swedish. Study of different materials available outside COCTAILL, e.g. books written by Anna Hallstr√∂m, multi-word expressions in Saldo and Lexin, may also prove to be beneficial, however, the challenge would be to define at which level these items should be introduced. To get some inspiration, have a look at English Vocabulary Profile: http://vocabulary.englishprofile.org/staticfiles/about.html (user: englishprofile, password: vocabulary)
  • (potentially) to implement one or more of the suggested exercise formats as web services + user interface in L√§rka
  • evaluate/test on users (language learners, teachers, linguists, etc)


Recommended skills:

  • Python
  • interest in Lexical Semantics and Second Language Acquisition


Supervisor(s)


  • Elena Volodina
  • potentially others from Spr√•kbanken/FLOV

Developing an adaptive diagnostic vocabulary/grammar test for Swedish (2016)

Developing an adaptive diagnostic vocabulary/grammar test for Swedish

Goal

Implement an adaptive diagnostic test for vocabulary and/or grammar for Swedish, based on Second Language Acquisition (SLA) research and frequency statistics available from the COCTAILL corpus.


Background

The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp. Attempts are being made to align generated exercises with CEFR proficiency scales (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). The actual users, however, may not know their level when they start working with the exercise generator. It is therefore important (and user-friendly) to offer some sort of placement/diagnostic test for those who may need it.


Some examples of existing diagnostic tests for vocabulary are:


Problem description

The aims of this work are the following:

  • to study literature on diagnostic testing for different language skills and competences that are relevant for the CEFR;
  • to find out about other ‚Äúactors‚ÄĚ dealing with CEFR-based tests for Swedish, especially for placement/diagnosis; as a result, to suggest a format for a placement test for one or (better) a range of language skills and competences mentioned in the CEFR
  • to implement the suggested test(s) in the form of web services that can be embedded into L√§rka platform (+ eventually develop the user interface module for that). Here it would be interesting, for example, to see formats where free answers could be provided and scored
  • evaluate/test on users (language learners, teachers, linguists, etc.)


Recommended skills:

  • Python
  • interest in Lexical Semantics


Supervisor(s)

  • Elena Volodina, Ildiko Pilan
  • potentially others from Spr√•kbanken
X
Loading