• research_text_lab

research_text_lab

Topic models in Swedish Litterature and other Collections

 

Topic modeling is a simple way to analyze large volumes of unlabeled text. A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. (Wikipedia <http://en.wikipedia.org/wiki/Topic_model>). Thus, a "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with "similar" meanings and distinguish between uses of words with multiple meanings. For a general introduction to topic modeling, see for example: Steyvers and Griffiths (2007).

Material and applications

The textual material the topic modeling resources will be applied on is i) Swedish literature collections and ii) Swedish biomedical texts. The Purpose is to identify e.g. topics that rose or fell in popularity; classify text passages (cf. Jockers, 2011); visualize topics with authors (cf. Meeks, 2011); identify potential issues of interest for historians, literary scholars or other (cf. Yang et al., 2011).
 

Avaialable Software to be used:

  • MALLET <http://mallet.cs.umass.edu/topics.php>
  • Gensim – Topic Modelling for Humans (Python) <http://radimrehurek.com/gensim/>
  • topicmodel in R; <http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf>
  • Comprehensive list of topic modeling software <http://www.cs.princeton.edu/~blei/topicmodeling.html>

Requirements

Good programming skills
Not necessary to have Swedish as mother tongue!
 

Supervisors

Dimitrios Kokkinakis
Richard Johansson
Mats Malm

References

Blei DM. 2012. Probabilistic  topic models. Communications of the ACM. vol. 55 no. 4. <http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf>

Jockers M. 2011 Who's your DH Blog Mate: Match-Making the Day of DH Bloggers with Topic Modeling Matthew L. Jockers, posted 19 March 2010

Meeks E. 2011 Comprehending the Digital Humanities Digital Humanities Specialist, posted 19 February 2011

Steyvers M. and Griffiths T. (2007). Probabilistic Topic Models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum. <http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf>.

Yang T., Torget A. and Mihalcea R. (2011) Topic Modeling on Historical Newspapers. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. The Association for Computational Linguistics, Madison, WI. pages 96–104.

Extensive Topic Modeling bibliography: <http://www.cs.princeton.edu/~mimno/topics.html>

Automatic alignment between expert and non-expert language

Goal

Create automatic alignment between professional medical vocabulary and non-expert vocabulary in Swedish in order to enhance an information retrieval system.

Background

Health care professionals and lay persons express themselves in different ways when discussing medical issues. When searching for documents on a medical topic they most likely are interested in finding documents on different reading levels and with different vocabulary. It could also be the case that the user expresses the search query in terms typical for one group or the other, while being interested in finding documents from both categories.

Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients which could be used both for categorization of terms and for testing.

Problem description

The approach is a question of automatic alignment between expert and non-expert terminology. The objective is to enrich an information retrieval system with links between corresponding concepts in the two sublanguages. The alignment can be done by different machine-learning techniques, such as k-nearest neighbor classifiers or support vector machines.

Automatic alignment of the vocabulary of the two groups could help the user either to find documents written for a certain target group or to find documents for either group even if the query only contains terms from one.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.

Supervisors

Karin Friberg Heppin and possibly others from Språkbanken.

References

Diosan, Rogozan and Pècuchet. 2009. Automatic alignment of medical terminologies with general dictionaries for an efficient information retrieval. Information retrieval in biomedicine: Natural language processing for knowledge integration.

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients.

Using medical domain language models for expert and non-expert language for user oriented information retrieval

Goal

Create language models for medical experts and for non-expert from Swedish medical documents and use these in order to enhance an information retrieval system to retrieve documents on a, for the user, suitable level of expertise.

Background

When searching for documents on a medical topic health care professionals and lay persons most likely are interested in finding documents on different levels of expertise. Most information retrieval systems do not adjust the returned ranked list of documents to the users background.

As Språkbanken has a Swedish medical test collection with documents marked for target group: Doctors or Patients this could be used to make language models for the two user groups which then could be used to adjust the results to the users needs.

Problem description

The approach is to make language models for medical expert language and for lay persons. The objective is to describe differences between the sublanguages and to use these models to retrieve documents suited for the user.

Recommended skills

General knowledge in Swedish.

Some knowledge of information retrieval.

Some knowledge of machine learning.

Programming skills, for example in Python.

Supervisors

Karin Friberg Heppin and possibly others from Språkbanken.

References

Hiemstra, D. 2000. Using language models for information retrieval.<http://wwwhome.cs.utwente.nl/~hiemstra/papers/thesis.pdf>

Friberg Heppin. 2010.Resolving power of search keys in MedEval – A Swedish medical test collection with user groups: Doctors and Patients. <https://gupea.ub.gu.se/handle/2077/22709>

Diabase

The need for a basic research infrastructure for language technology is increasingly recognized by the language technology research community and research funding agencies alike. At the core of such an infrastructure we find the so-called BLARK -- Basic LAnguage Resource Kit, a set of language resources and language technology tools deemed essential both to fundamental research in language technology and to the development of useful language technology applications for a language. The BLARK, as normally presented in the literature, reflects a modern standard language variety, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. However, modern linguistics increasingly recognizes variation as a fundamental and essential characteristic of human language. We thus argue that a BLARK could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal). In our case, it would be extended along the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.

We are currently extending and merging two lexical resources, SALDO and Dalin. Additionally, we have three major dictionaries of Old Swedish (1225--1526): Söderwall (23,000 entries), Söderwall supplement (21,000 entries), and Schlyter (10,000 entries). Due to overlap, the three resources together contain just under 25,000 different entries/lemmas/headwords. We have started the work on creating a morphological component for Old Swedish, covering the regular paradigms and created a smaller lexicon with a couple of thousand entries.

The natural next step after linking up SALDO and Dalin would be to add the Old Swedish lexicon to this growing diachronic Swedish lexical and morphological resource. Including the Old Swedish lexicon in the same way as we are doing this for Dalin's dictionary will probably be more difficult, however, since the distance between Old Swedish and the other two forms of the language is fairly great, something like that between modern English and Anglo-Saxon (Old English). This certainly holds for the grammar -- morphology and syntax -- of the language, and even more so for the semantic information encoded in the SALDO lexical resource. It will be a difficult but hopefully rewarding endeavor to work with the lexical semantics of Old Swedish.

Additionally, we are working on a Swedish FrameNet, building in part on the SALDO work and in part on our long experience in corpus linguistics. In this way, we should be able to forge a bridge from the lexical databases which we have already developed, to syntactic analysis systems. The hypothesis is that substantial parts of the frame semantic specifications in the modern Swedish FrameNet will carry over to the lexical items in Dalin's dictionary, using the (semantic) links independently established between SALDO and Dalin, and possibly further to the Old Swedish lexical resources.

Project contact: Lars Borin (sb-info@svenska.gu.se). For more information, see the project web page: http://spraakbanken.gu.se/eng/forskning/diabase .

Digital areal linguistics

The goal of this project is to create a database of comparable lexical items in a number of representative South Asian languages, with a focus on the Himalayan region in India and to use this database for investigating the Himalayas as a linguistic area.

The project is a collaboration with the IDS project (MPI Leipzig, Germany), an international initiative for collecting comparable basic vocabulary in a large number of languages, where the database can be used for various different purposes. The database will enable the project to investigate questions relating to South Asia and the Himalayan region as linguistic macro- and micro-areas.

The results of this project will contribute to methodological development in digital documentation and linguistic typology research. It will add South Asian languages to the IDS. It will also contribute to our knowledge of the languages of this region, of the Himalayas as a linguistic area, and of areal-typological linguistics in general.

Project contact: Lars Borin (sb-info@svenska.gu.se). For more information, see the project web page: http://spraakbanken.gu.se/eng/research/digital-areal-linguistics .

CONPLISIT

The research program CONPLISIT, Consumption patterns and life-style in Swedish literature -- novels 1830-1860, has two specific goals: Firstly to deepen the knowledge of the creation of the consumer society, and secondly to develop new methods by the use of literature during the period 1830-1860 as a main source for understanding the context of consumption in this period. This is enabled by a semantically organized lexical resource with a morphological analysis component.

The research is a collaboration between Språkbanken, the Department of Historical Studies at University of Gothenburg, and Litteraturbanken.

Project contact: Lars Borin (sb-info@svenska.gu.se). For more information, see the project web page: http://spraakbanken.gu.se/eng/research/conplisit .

META-NORD

META-NORD is a European project to establish an open linguistic infrastructure in the Baltic and Nordic countries.

Project contact: Martha Dís Brandt (meta-nord@svenska.gu.se). For more information, see the local project web page: http://spraakbanken.gu.se/eng/research/meta-nord , and the european project web page: http://www.meta-net.eu/projects/meta-nord/ .

Name linking and visualization in (Swedish) digital collections

Aim

Given a number of Swedish novels taken from the Swedish Literature Bank (<http://litteraturbanken.se/#!om/inenglish>), pre-annotated with named entities (i.e. person names with their gender [male, female or unknown]), the purpose of this work is to: 

i) find pronominal and other references associated with these person entities and link them to each other and ii) apply different visualization techniques for analyzing the entities in these novels with respect to the characters involved; e.g. using a network representation so that it would be easy to identify possible clusters of e.g. people "communities".

 

Obviously (i) aims at developing a (simple) coreference resolution software for Swedish either rule based, machine learning or hybrid. According to Wikipedia: "co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent." For example, in the sentence "Mary said she would help me", "she" and "Mary" are most likely referring to the same person or group, in which case they are coreferent. Similarly, in "I saw Scott yesterday. He was fishing by the lake," Scott and he are most likely coreferent." With respect of (ii) any available visualization software can be used and there a number available such as: Visone; Touchgraph or Gephi.

 

As a practical application the resulting software will be used as a supporting technology for literature scholars that want to get a bird's eye view on analyzing literature; for social network analysis etc.

 

Background

This project deals with "name linking and visualization" in digital collections (e.g. novels). Theoretically the focus of the project will be framed around the term ”distant reading” (Moretti, 2005) or "macro analysis". Distant reading means that "the reality of the text undergoes a process of deliberate reduction and abstraction”. According to this view, understanding literature is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data. This way it becomes possible to detect possible hidden aspects in plots, the structure and interactions of characters becomes easier to follow enabling experimentation and exploration of new uses and development that otherwise would be impossible to conduct. Moretti advocated the usage of visual representations such as graphs, maps, and trees for literature analysis.

Prerequisites:

Some Swedish language skills - probably not need to be a native speaker 

Very good programming skills.

Supervisors

Dimitrios Kokkinakis, PhD, Department of Swedish

Richard Johansson, PhD, Department of Swedish

Mats Malm, Prof., Department Language and Literature

Some Relevant Links

Matthew L. Jockers website <http://www.matthewjockers.net/>

Franco Moretti. 2005. Graphs, maps, trees: abstract models for a literary history. R. R. Donnelley & Sons.

Daniela Oelke, Dimitrios Kokkinakis, Mats Malm. (2012). Advanced Visual Analytics Methods for Literature Analysis. Proceedings of the Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). An EACL 2012 workshop. Avignon, France. <http://demo.spraakdata.gu.se/svedk/pbl/FINAL_eacl2012-1.pdf>

 

Fact extraction from (bio)medical article titles

Introduction

Article titlesare brief, descriptive and to the point, using very well-chosen, specific terminology that intend to attract the reader's attention. Factual information extraction from such article titles and the construction of structured fact data banks has a great potential to facilitate computational analysis in many areas of biomedicine and in the open domain.

Purpose

Given a Swedish collection of published article titles the purpose of this proposal is twofold:

a) to automatically classify titles into factual and non-factual. For this you will need to:

  • write some simple guidelines that will help you differentiate between factual/non factual instances/examples
  • annotate titles as factual or not
  • decide and extract suitable attributes (features) such as verbs, n-grams etc.
  • experiment with one (or more) machine learning algorithms
  • evaluate and report results

b) to extract sets of triplesfrom the factual titles and represent them in a graphical way using available software such as "visone" or "touchgraph".

A factual title in biomedicine according to Eales et al. (2011) is: "a direct (the title does not merely imply a result but actually states the result) sentential report about the outcome of a biomedical investigation". In this proposal, we take a little more general approach since our data is not strictly biomedical, but medical in general. Such results can be both a positive or negative outcome. For instance the first example below is positive and the second negative (the annotations provided below are simplified for readability):

"Antioxidanter motverkar fria radikalers nyttiga effekter" (LT nr 28–29 2009 volym 106, pp 1808)

<substance>Antioxidanter</substance> motverkar <substance>fria radikalers</substance> nyttiga <qualifier value>effekter</qualifier value>

"B12 och folat skyddar inte mot hjärt-kärlsjukdom" (LT nr 38 2010 volym 107, pp, 2228)

<substance>B12</substance> och <substance>folat</substance> skyddar inte mot <disorder>hjärt-kärlsjukdom</disorder>

A non-factual fact can be a title that does not state all the necessary contextual information in order to fully understand whether the results or implications of the finding have a factual (direct) outcome. Moreover, a non-factual fact can be one with speculative language, such as:

"Hyperemesis gravidarum kan vara ärftlig" (LT nr 22 2010 volym 107, pp 1462)

<disorder>Hyperemesis gravidarum</disorder> kan vara <qualifier value>ärftlig</qualifier value>

"Influensavaccinering av friska unga vuxna" (LT nr 14 2002 volym 107, pp 1600)

<procedure>Influensavaccinering</procedure> av <person>friska unga vuxna</person>

For training and evaluation the article title corpus needs to be suitably divided, e.g. 75%-25% into training sentences and test sentences. All will be manually annotated as "factual" or "non-factual" but the test portion will be only kept for evaluation and not used during training (e.g. feature generation).

Material

A Swedish collection of published article titles (about 1,000) will be provided in two formats, a raw (unannotated) format and an annotated version with labels from a medical ontology. Also, a few titles can be composed of several sentences and these can be a mix of factual and non-factual statements. A number of other annotation can be provided, if necessary, such as part of speech tags.

Prerequisites

Native Swedish or good Swedish language skills - all data is Swedish.

Good programming skills, interest (experience) in Machine Learning is a plus!

Supervisors

Dimitrios Kokkinakis

Richard Johansson

References

  1. Eales J., Demetriou G. and Stevens R. 2011. Creating a focused corpus of factual outcomes from biomedical experiments. Proceedings of the Mining Complex Entities from Network and Biomedical Data. Athens, Greece.
  2. Kastner I. and Monz C. 2009. Automatic Single-Document Key Fact Extraction from Newswire Articles. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). Athens, Greece.
  3. Kilicoglu H. and Bergler S. 2008. Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics 2008, 9(Suppl 11):S10.

 

Clustering corpus paragraphs for lexical differentiation

Goal

.Developing and evaluating a system for clustering corpus paragraphs in order to differentiate word usages in the corpus

Background

Determining the range of usages for a particular word in a corpus is a great challenge. Particular aspects of this problem are investigated under the headings word sense disambiguation and word sense induction. In Språkbanken <http://språkbanken.gu.se>, the focus is on developing language-aware tools to aid us in building lexical resources, such as the Swedish FrameNet and Swesaurus (a Swedish wordnet).

Problem description

The paragraph is the smallest content unit of a text. The project aims at classifying/clustering paragraphs in a corpus in a way which makes it likely that the same lemma occuring in paragraphs of the same class (in the same cluster) will reflect the same sense of this word. This will allow us to design a corpus search interface where such hits are collaped by default and potentially different senses can be highlighted.

The work should preferably be carried out on the Swedish SUC corpus, but some suitable English corpus could be used instead, e.g., in the framework of NLTK. Many relevant tools are available in Java; hence, a familiarity with Java is necessary.

Recommended skills

  • Fair linguistic analysis skills in the target language
  • Good programming skills, including familiarity with Java

Supervisor(s)

Lars Borin and possibly others, Språkbanken

X
Loading