A poor man's approach to integrating POS-tagging and parsing.
In the by now traditional NLP processing setup, part-of-speech tagging and syntactic parsing are separate, ordered tasks. A sentence is first POS-tagged, after which the results are used as the input for parsing. This model is convenient because it allows one to use a more efficient technique for the "simpler" task of POS-tagging and it helps to keep search space in the expensive parsing task down.
On the downside, however, we note that a POS-tagger is missing out on possibly beneficial syntactic information -- POS-tagging precedes parsing and therefore syntactic information cannot be used to choose between alternative tag sequences. In turn, we can expect parsing to suffer from a resulting decreased accuracy in POS-tagging.
Indeed, in PCFG-based parsing, parsing and POS-tagging has long been one and the same processing step. More recently, in the data-driven dependency parsing literature, algorithms for combined parsing and POS-tagging have been proposed, and they have been shown to lead to improved results.
In this project, you will investigate a simpler, more general approach to integrating POS-tagging and parsing, by letting the POS-tagger and the parser entertain multiple hypothesis about the analysis of a sentence, from which the most best analysis can then be chosen. This way one can achieve a free flow of information between the two processes -- hopefully improving accuracy -- without having to radically change the NLP setup (POS-tagging still precedes parsing). Existing tools can be used with very little alteration, which makes it a poor man's solution.
As part of the project, you will investigate, design and implement different ways of realizing the setup sketched above, and present experiments showing the impact of your choices on analysis accuracy and efficiency.
This MA project combines theoretical aspects (literature study and design of a realization of the system outlined above), implementation, and empirical study (evaluation of the system).
Programming skills and an NLP background are a prerequisite. Knowledge of statistical methods is a big plus, as is affinity with the linguistic side of processing, as this will allow you to do more insightful error analysis. Since the development material will be Swedish text, some passive knowledge of the Swedish language is assumed.
The project would be supervised by Gerlof Bouma and Richard Johansson, Yvonne Adesam or possibly others at Språkbanken.