The purpose of this project is to: 1. conduct a literature review in the area of feature extraction from dialogue data and spoken language transcriptions 2. implement a (large) set of lexico-syntactic and semantic features from the papers reviewed in the previous step 3. build or use an existing classifier by using the extracted features from the previous step. The application would be able to differentiate between transcribed spoken dialogues from dialogues in other corpora.
In the context of the project Pragmatic Resources for Old Indo-European Languages (University of Oslo, 2008 - 2012), a formalism and guidelines for the annotation of parts-of-speech, morphology and dependency syntax in several historical Indo-European languages have been developed, such as Ancient Greek, Latin, Old Church Slavonic and Gothic. This formalism and (derived) guidelines have by now been used for 18 different languages / language stages. This includes Old Swedish as part of the project Methods for the automatic Analysis of Text in digital Historic Resources, which currently runs at the Department of Swedish at GU.
Syntactic annotation comes in the form of dependency trees, with additionally the use of so called secondary edges to capture argument sharing and the use of empty tokens to capture ellipsis of verbs and coordinators. As far as we are aware, no one has attempted to statistically parse the PROIEL format. In this project, you will investigate how to parse this format and construct a working set up so that anyone with a PROIEL corpus can come and train a parser on their annotated data.
There is a fair amount of training material available to train parsers on, a selection of which can be used in the MA project. In total, the number of annotated tokens with some kind of PROIEL annotation is well over a million, with the largest languages having over 100k tokens of annotated data.
An important part of the project is to investigate how to tackle the fact that the PROIEL syntactic format is not a classic dependency tree -- are there existing statistical parsers that can handle a format like PROIEL's, can you adjust/extend an existing parser, or could you handle the PROIEL format with a standard parser and some pre- and post-processing? You will also evaluate your final solution's performance on some of the existing annotated material. The historical material also has other challenging features for parsing, such as non-standard orthography, a lack of uniform sentence marking and morphologically rich language, but these issues are not intended to be the focus of the project.
This MA project combines theory (forming an understanding of the challenges of the syntactic format for statistical parsing), literature study (surveying the field for existing/adaptable solutions), implementation, and empirical research (evaluation of the final system(s)).
A working solution of decent quality is a publishable result and will be of practical interest of creators of new PROIEL treebanks, as a parser could be used to support future manual annotation efforts.
Programming skills and an NLP background are a prerequisite, as is some knowledge of statistical methods. Affinity with the linguistic side of processing is a plus as this will for instance allow you to do more insightful error analysis.
The project would be supervised by Gerlof Bouma, Yvonne Adesam or possibly others at Språkbanken.
The goal of the Master thesis will be to: i) use/process a large Swedish text collection ii) experiment and apply topic modeling and consequently sub-corpus topic modeling (according the description by Tangherlini & Leonard, 2013) iii) adapt or create a visual, web based environment to explore the results (this will be done in various ways, preferably as a) network graphs (Smith et al., 2014); se for instance figure 1 and integrated them in b) a web based exploratory environment, such as a dashboard; se figure 2)
For more see the link below:
A poor man's approach to integrating POS-tagging and parsing.
In the by now traditional NLP processing setup, part-of-speech tagging and syntactic parsing are separate, ordered tasks. A sentence is first POS-tagged, after which the results are used as the input for parsing. This model is convenient because it allows one to use a more efficient technique for the "simpler" task of POS-tagging and it helps to keep search space in the expensive parsing task down.
On the downside, however, we note that a POS-tagger is missing out on possibly beneficial syntactic information -- POS-tagging precedes parsing and therefore syntactic information cannot be used to choose between alternative tag sequences. In turn, we can expect parsing to suffer from a resulting decreased accuracy in POS-tagging.
Indeed, in PCFG-based parsing, parsing and POS-tagging has long been one and the same processing step. More recently, in the data-driven dependency parsing literature, algorithms for combined parsing and POS-tagging have been proposed, and they have been shown to lead to improved results.
In this project, you will investigate a simpler, more general approach to integrating POS-tagging and parsing, by letting the POS-tagger and the parser entertain multiple hypothesis about the analysis of a sentence, from which the most best analysis can then be chosen. This way one can achieve a free flow of information between the two processes -- hopefully improving accuracy -- without having to radically change the NLP setup (POS-tagging still precedes parsing). Existing tools can be used with very little alteration, which makes it a poor man's solution.
As part of the project, you will investigate, design and implement different ways of realizing the setup sketched above, and present experiments showing the impact of your choices on analysis accuracy and efficiency.
This MA project combines theoretical aspects (literature study and design of a realization of the system outlined above), implementation, and empirical study (evaluation of the system).
Programming skills and an NLP background are a prerequisite. Knowledge of statistical methods is a big plus, as is affinity with the linguistic side of processing, as this will allow you to do more insightful error analysis. Since the development material will be Swedish text, some passive knowledge of the Swedish language is assumed.
The project would be supervised by Gerlof Bouma and Richard Johansson, Yvonne Adesam or possibly others at Språkbanken.
The goal of this project is the semi-automatic construction of a sentiment lexicon for Swedish. For more information see link below.
The goal of this project is to improve a Swedish dependency parser by integrating a valency lexicon. For more information see link below.
The goal of this project is to implement a part-of-speech tagger and investigate the possibilities of developing a syntactic parser that could handle emergent text, i.e. texts – or representations of texts – that are being produced (and thus frequently changed) in order to identify the syntactic location of for example pauses. For more information see link below.
The goal of the Master thesis will be to apply (implement or adapt) techniques (e.g., borrowed from the field of bioinformatics) to identify lexically-similar passages (i.e. phrases, sentences, quotes, paraphrases) across collections of Swedish literary texts. Such techniques can use any suitable algorithms for that purpose, but preferably sequence alignment and present/visualize the results in a user friendly (and navigable) way.
Generate a list of collocations, phrasal verbs, set phrases and idioms important for learners of Swedish, linked to proficiency levels, for use in Lärka.
The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp, one of them focusing on vocabulary. It has been mentioned on several occasions that we should include multi-word expressions into our exercise generator. This also complies with the CEFR “can-do” statements at different levels of proficiency (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). It is, however, a non-trivial task to identify the items that should be included into the curriculum, and even more uncertain how the selected items can be assigned to different proficiency levels.
The aims of this work are the following:
Implement an adaptive diagnostic test for vocabulary and/or grammar for Swedish, based on Second Language Acquisition (SLA) research and frequency statistics available from the COCTAILL corpus.
The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp. Attempts are being made to align generated exercises with CEFR proficiency scales (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). The actual users, however, may not know their level when they start working with the exercise generator. It is therefore important (and user-friendly) to offer some sort of placement/diagnostic test for those who may need it.
Some examples of existing diagnostic tests for vocabulary are:
The aims of this work are the following: