• research_text_lab

research_text_lab

Number Sense Disambiguation for Swedish - "Assign each Number a Sense"

Background

Word Sense Disambiguation is a well studied field in the Natural Language Processing Community which has resulted in a full range of successful methods and software. However, the identification and disambiguation of numerical information in natural language text is not so well studied and to the best of our knowledge there has not been yet research in Sweden on empirical evidence of the linguistic variation of numerical expressions, therefore this work is a good opportunity to investigate this topic since it is important in many tasks in natural language processing that require understanding of e.g. quantities (e.g. in information extraction or Q&A).

A numerical expression in a text is a sequence or combinationof digits with possible operators, identifiers or a mathematical symbols. Numerals in text can be used to express a variety of different senses, in a similar manner that words are used in different senses. For instance, "11" can denote:

  • the age of a person "11 years of age"
  • a reference of time "11 hours"
  • a reference to a published article "see [11]"
  • a quantity "11 women"
  • a part of a phone number "011-726 11 28"
  • a frequency "11 Hz"
  • a latitude "11 degrees"
  • a length unit "11 km2"
  • a dose "11 mg/ml"
  • ...

Purpose

The purpose of this work is on numerical information processing and the development of new/or adaptation of existing algorithms for numerical information identification and disambiguation on Swedish text material. Depending on the background and interest of the student, the work can be given different focus and scope; e.g. own implementation of a numerical information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for numerical processing, etc.

Application

As a practical application the resulting software will be used as a supporting technology for number sense disambiguation of medical data perhaps using the LOINC ontology.

Supervisors

Dimitrios Kokkinakis,PhD, Department of Swedish, and possibly others.

Prerequisites:

Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links and References

NUMEX: SPECIFIC GUIDELINES - Message Understanding Conferences MUC-6 <http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_17.html#HEADING44>

LOINC: Logical Observation Identifiers Names and Codes (LOINC®) Users' Guide. Clem McDonald, Stan Huff, Kathy Mercer, Jo Anna Hernandez, Daniel J. Vreeman

Definition of Sekine’s Extended Named Entity, Version 6.1.0 (English). 2003. <http://qallme.fbk.eu/SekineENE_Definition_v6.pdf>

Stuart Moore, Anna Korhonen and Sabine Buchholz. 2009. Number Sense Disambiguation. In Proceedings of the 12th Conference of the Pacific Association for Computational Linguistics. Sapporo, Japan.

 

Temporal information in Swedish - identification, resolution, normalization and standardization

Background

Identification and resolution of temporal (and numerical) information in natural language text is important in many tasks in artificial intelligence (temporal reasoning) and natural language processing (information extraction and retrieval, Q&A).

A temporal expression in a text is a sequence of tokens  (words, numbers and characters) that denote time, that is express a point in time, a duration or a frequency.

Purpose

The purpose of this work is on temporal information processing and the development of algorithms for temporal information identification, resolution, normalization and standardization using TIMEX3/TimeML (or equivalent) on Swedish text material.

For instance the examples below illustrate hoe the TIMEX3-format is used:

  • "June 7, 2003": <TIMEX3 tid="t1" type="DATE" value="2003-06-07">
  • "the dawn of 2000": <TIMEX3 tid="t2" type="DATE" value="2000" mod="START">the dawn of 2000</TIMEX3>

A more complex example can look like this:

  • "two weeks from June 7, 2003": <TIMEX3 tid="t6" type="DURATION" value="P2W" beginPoint="t61" endPoint="t62">two weeks</TIMEX3> from <TIMEX3 tid="t61" type="DATE" value="2003-06-07">June 7, 2003</TIMEX3><TIMEX3 tid="t62" type="DATE" value="2003-06-21" temporalFunction="true" anchorTimeID="t6"/>

Depending on background and interest of the student, the work can be  given different focus and scope; e.g. own implementation of a temporal information processing or adapting available software to Swedish; compare the effect of different resources and module combinations for temporal processing, etc.

Application

As a practical application the resulting software will be used as a supporting technology for de-identifying temporal information of patient data. Normalized and standardized temporal occurrences in authentic text (patient history) will be used to "mask" the temporal information on the text. For instance, a text occurrence of the date "2011-12-15" will be converted to e.g. "start date + 4 months 7 days" (under the assumption that 'start date' is a relevant point in time from where a patient history started to be recorded).  Note! The development of this application will be made on non-authentic texts but the intention is to use the developed software on real data.

Supervisors

Dimitrios Kokkinakis, PhD, Department of Swedish

Staffan Svensson, PhD, MD, specialist in clinical pharmacology

Prerequisites:

Native Swedish or good Swedish language skills.

Good programming skills.

Relevant Links

TempEval Temporal Relation Identification <http://timeml.org/tempeval/>

TempEval2: Evaluating Events, Time Expressions, and Temporal Relations <http://www.timeml.org/tempeval2/>

TempEval3: Temporal Annotation <http://www.cs.york.ac.uk/semeval-2013/task1/>

TimeML: Markup Language for Temporal and Event Expressions <http://www.timeml.org/site/index.html>

TIMEX at MUC-6 <http://www.timexportal.info/timexmuc6>

Guidelines for Temporal Expression Annotation for English for TempEval 2010. <http://www.timeml.org/tempeval2/tempeval2-trial/guidelines/timex3guidelines-072009.pdf>

A multilingual corpus database for typological and genetic linguistics

Goal

.Building a multilingual corpus database and interface for typological and genetic linguistics research

Background

Over the last few years, linguists and computaional linguists have started looking into the possibilities of using multilingual corpora (mainly parallel corpora) for typological and genetic linguistic research.

Problem description

The aims of this work are (1) to collect and link at the verse level as many digitized Bible texts as possible; (2) to apply linguistic annotation tools for those languages where such tools are available (at least English and Swedish); (3) to correlate linguistic units of varying granularity among the languages using the linguistic annotations and freely available word alignment tools; (4) to design the first version of a user interface for conducting research with the database; (5) to conduct a small typological or genetic linguistic study as a showcase of the utility of the database and user interface.

Recommended skills

  • Good knowledge of typological and possibly genetic linguistics
  • Very good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Linking a pronunciation lexicon to SALDO

Goal

.Linking the NST Swedish pronunciation lexicon to SALDO

Background

The NST Swedish pronunciation lexicon is a large (almost 1M entries) fullform lexicon for Swedish linking text words to their (standard) pronunciations. SALDO is a large semantic and morphological lexicon for Swedish (see <http://spraakbanken.gu.se/eng/saldo/>).

Problem description

The aim of this work is to link lexical entries in SALDO to the corresponding entries in the NST lexicon, as well as to explore the feasibility of providing the SALDO-FM morphological component with a pronunciation module, i.e., generate pronunciations for word forms not present in the NST lexicon from parts of word forms that are in this lexicon.

Recommended skills

  • Good knowledge of Swedish morphological analysis, and some knowledge of Swedish phonology and phonetics
  • General familiarity with morphological analysis systems
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Swedish multiword-entity extraction

Goal

.Finding long (>2 words) word/lemma n-grams

Background

Multiword entitites have received much attention lately both in computional linguistics and in general linguistics. There is a long tradition in computational and corpus linguistics of mining multiword entities from text by applying (a wide range of) collocation measures to pairs of entities (text words, lemmas, syntactic dependencies), contiguous or non-contiguous, in order to find two-word lexical units or terms. Attempts to discover longer units are much more rare in the literature, in part because good collocation measures seem to be lacking for this problem.

Problem description

The aim of this work is to refine a purely frequency-based way of finding contiguous word n-grams in annotated text, for instance by applying methods from work on automatic word segmentation. The preferred target language is Swedish. English is also acceptable, but in this case, Språkbanken can provide only limited support wrt annotation tools and linguistic expertise.

Recommended skills

  • Good knowledge of linguistic analysis
  • General familiarity with POS tagging and parsing
  • Familiarity with machine learning
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Extending IDS

Goal

.Enriching IDS using Wiktionary and other multilingual resources

Background

The IDS list is a kind of “universal base vocabulary” containing about 1,500 word senses. See <http://lingweb.eva.mpg.de/ids/>, <http://spraakbanken.gu.se/eng/research/digital-areal-linguistics/word-lists> and <http://spraakbanken.gu.se/swe/sblex/resources#lwt>. There is a general wish on the part of the main editor of the IDS effort to collect IDS lists for as many languages as possible.

Problem description

This project should address the problem of using freely available multilingual resources, such as Wiktionary, in order to add new full or partial IDS lists to the collection. The work should include implementing a way of generating candidate IDS lists from, e.g., Wiktionary, as well as an evaluation of the method by using it to generate lists for languages that are already in the IDS collection.

Recommended skills

  • Fair knowledge of lexicography
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

A Swedish diachronic lexicon

Goal

Automatic diachronic linking of Swedish lexical resources .

Background

Språkbanken <http://språkbanken.gu.se> possesses a number of digitized lexical resources in various stages of preparation and representing i.a. various historical forms of Swedish. One of the resources – SALDO – is singled out as the pivot resource to which all the others should be linked in some way. The hope is that the resulting interlinking of the lexicons will enable many kinds of linguistic information to be transferred among them. However, the interlinking of the lexical resources has only begun and there is much scope for innovation.

Problem description

This problem is an open one and should be suitably narrowed down to be solvable in the framework of a master’s thesis, e.g., by focusing on one pair of lexical resources but of course with a view to the general applicability of the proposed solution. On the one hand, there are the lexicons themselves with the associated, partly overlapping linguistic information. On the other hand, there are various external resources, such as text corpora representing different historical language stages, and possibly freely available external lexicons. The problem more narrowly construed consists in proposing and implementing a set of tools for interlinking the lexicons, using all and any relevant information available, as well as some kind of evaluation procedure. The interlinking should be semi-automatic, and the extent of the manual component should be explicitly indicated as part of the result (e.g., as the number and percentage of ambiguous links).

Recommended skills

  • Fair knowledge of Swedish lexicography, lexical semantics and grammatical analysis
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

Spelling variation in Swedish text

Goal

Dealing with spelling variation in Swedish text in order to improve lemmatization, part-of-speech tagging and parsing.

Background

Språkbanken <http://språkbanken.gu.se> uses an in-house large lexical resource cum morphological analyzer, plus an off-the-shelf part-of-speech tagger and dependency parser to annotate its online corpora. These tools expect standardized spellings in the texts to be analyzed (although the data-driven tools – the POS tagger and parser – will handle out-of-vocabulary items which are not recognized by the morphological analyzer).

Problem description

Many of the texts in Språkbanken also sport non-standard spellings, either because they represent a pre-standardization language stage – medieval and 17th century texts – or because they are full of spelling errors and variants, which often is the case with modern blog texts. The problem consists in developing and implementing a (partial) solution for discovering and dealing with the spelling variation in modern texts (for which we already have sufficiently large-scale language analysis tools). Preferably the solution should be general and extensible to other text types. The work thus includes a good deal of linguistic analysis of lemmatizer, POS tagger and parser output.

Recommended skills

  • Good knowledge of Swedish grammatical analysis
  • General familiarity with POS tagging and parsing
  • Good programming skills

Supervisor(s)

Lars Borin and possibly others, Språkbanken

CLT Cloud (A small CLT Project)

The objective of this combined research and infrastructure project is to equip lexica, semantic databases, morphological processors, parsers, compilers and other software components developed within CLT with so called web API:s, thus making them available on the internet in the form of web services. This will complement CLT:s open source software offerings with “data as service” and “software as service”, thus enabling users both inside and outside of CLT to develop language technology enabled applications in the form of so called “mashups”. In the project, such API:s will be carefully designed, implemented and documented, as well a collected under a common CLT access point to be advertised to the world as “CLT Cloud”. Given the broad scope and high quality of software developed within CLT, it is likely that fully developed such services will draw a lot of traffic, and thus further improve CLT’s reputation. Responsible researchers are professor Torbjörn Lager and doctor Markus Forsberg.

Swedish FrameNet++

Swedish FrameNet++ (SweFN++) is our working name for a rich Swedish lexicon resource for language technology research and applications. This project started in the fall of 2009 with funding from various sources: The Faculty of Arts, University of Gothenburg provides basic funding through Språkbanken; the Database Infrastructure Committee of the Swedish Research Council partly funds the work within the project Safeguarding the future of Språkbanken (2008-2010); the project is part of the strategic plan for the development of CLT, which consequently also funds the project in part.

The project has two interconnected main goals:

  1. To harmonize a number of existing freely available lexical resources in order to combine the information available in them in an interoperable way;
  2. to add to this amalgamated resource frame information of the same kind as in (English) FrameNet

In both cases, we will need to do much manual work, but one additional goal of the project is to explore how a workflow can be organized so as to minimize manual labor and maximize the utilization of existing language technology tools as a kind of assistive technology.

By CLT policy, SweFN++ will be made available under an open source/open content license.

For more information, see the project web page: <http://spraakbanken.gu.se/swefn/>.

A presentation of the SweFN++ project was made at the FrameNet Masterclass and Workshop, Milan, Italy, 3rd December, 2009: <http://tlt8.unicatt.it/allegati/Session_I_3.pdf> <http://tlt8.unicatt.it/Slides/Borin_et_alii.pdf>

X
Loading