The sixth annual Språkbanken Autumn Workshop will be held on the 17th of October. The workshop theme this year is content (semantics).
The language infrastructure of Språkbanken is freely available to all researchers. Our web-based tools can be used to access all kinds of texts, anything from historical and modern newspaper texts, novels and poetry, social media outlets such as blogs and discussion forms. Use our tools to efficiently wade through billions of sentences and produce mesmerising visualisations. At our annual autumn workshop you can try the tools out! We’ll demo the new features, show you how they’re used, and get a discussion going around your particular research questions.
We will start at 13.15 with presentations featuring our research and research infrastructure and finish with some practical exercises combined with demo and poster presentations. This will be followed by a social gathering with some bubbly and snacks.
A programme is available here: https://spraakbanken.gu.se/swe/Om%20oss/hoestworkshop. Note that the workshop language is Swedish. In order to participate in the practical exercises you must bring a laptop, but this is not a requirement for participation in the workshop.
For planning purposes we kindly ask you to register here: https://spraakbanken.gu.se/swe/Om%2520oss/hoestworkshop/registration no later than 9th October if you are planning to attend.
Date: 2016-10-17 13:15 - 18:00
Location: L100, Lennart Torstenssonsgatan 8
We will show the new version of the Swe-Clarin toolbox at an inauguration ceremony. During the course of this day, researchers from different disciplines in digital humanities will talk about their experiences with using language data as primary research data. There will be stations where our tools are presented and a possibility to try them out with guidance. The evening will end with a mingle and refreshments.
You can read more about the event and indicate your interest in participation here: https://sweclarin.se/eng/Inauguration_of_the_Swe-Clarin_toolbox_webform.
Date: 2016-10-07 10:00 - 20:00
Location: Ågrenska villan
Date: 2016-06-03 10:00 - 12:00
Location: room EE, Campus Johanneberg
Join us for this one day workshop where researchers in the Gothenburg area (and guests) will share with us how they use machine learning to solve complex research questions in medicine, transport, biology, language technology and urban planning.
Event Website: http://bit.ly/1QXey0u
Please register here: http://doodle.com/poll/pm88pp6yvt469h97
Date: 2016-04-14 09:00 - 16:00
Location: Chalmers Johanneberg Campus, Palmstedt (Student Union Building)
Olof Mogren (Department of Computer Science and Engineering) will defend his licentiate thesis Multi-Document Summarization and Semantic Relatedness.
Automatic summarization is the process of presenting the contents of written documents in a short, comprehensive fashion. Many approaches have been proposed for this problem, some of which extract content from the input documents (extractive me thods), and others that generate the language in the summary based on some representation of the document contents (abstractive methods).
This thesis is concerned with extractive summarization in the multi-document setting, and we define the problem as choosing the most informative sentences from the input documents, while minimizing the redundancy in the summary. This definition calls for a way ofmeasuring the similarity between sentences that captures as much as possble of the meaning. We present novel ways of measuring the similarity between sentences, based on neural word embeddings and sentiment analysis. We also show that combining multiple sentence similarity scores, by multiplicative aggregation, helps in the process of creating better extractive summaries.
We also discuss the use of information extraction for improving the quality of automatic summarization by providing ways of assessing the salience of information elements, as well as helping with the fluency of the output and providing the temporal dimension.
Furthermore, we present graph-based algorithms for clustering words by co-occurrence, and for summarizing short online user-reviews by computing bicliques. The biclique algorithm provides a fast, simple algorithm for summarization in many e-commerce settings.
Tapani Raiko from Aalto University.
Thesis fulltext: http://www.cse.chalmers.se/~mogren/lic/mogren2015licentiate.pdf
Date: 2015-11-20 10:00 - 12:00
Location: ML2, Hörsalsvägen 7B, Chalmers
The fifth annual Språkbanken autumn workshop (höstworkshop) is held on Monday the 5th of October, starting at 13.15. The theme this year is historical resources and tools.
Read more about the workshop here: http://spraakbanken.gu.se/eng/Om%20oss/hoestworkshop
Date: 2015-10-05 13:15 - 19:00
Location: T307, Olof Wijksgatan 6
Jessica Villing, Department of Philosophy Linguistics and Theory of Science is defending her thesis "Towards Dialogue Strategies for Cognitive Workload Management".
Although it has been shown that drivers are less distracted when using speech interfaces compared to traditional interfaces, using voice control instead of manual controls does not completely solve the problem with distracted drivers. The interaction with the dialogue system may itself add to the driver’s cognitive workload and may therefore be a safety issue. The main purpose of this thesis is to learn more about in-vehicle dialogue during various types of cognitive workload, to use this knowledge to enable safe and non-distracting dialogue system interaction in vehicles. We do this by analysing a corpus of human-human in-vehicle dialogue to learn more about the dialogue strategies used by drivers and passengers during various types of workload. We discuss the types of cognitive workload that we believe are most important to consider when studying the multitasking activity of driving and interacting with a dialogue system, and suggest a method for distinguishing different types of workload by using information about the driver’s workload and driving behaviour. We found that dialogue strategies such as interruptions – in the form of silent pauses and domain switches – are used in response to the driver’s cognitive workload, as well as resumption of unfinished discussions. These behaviours are analysed in order to find strategies for preventing, or shortening the duration time of, high cognitive workload. We also indicate how these strategies can be implemented in in-vehicle dialogue systems.
Opponent: Associate Professor Andrew Kun, University of New Hampshire
Link to the dissertation: https://gupea.ub.gu.se/handle/2077/40178
Date: 2015-10-15 13:15 - 16:00
Location: Lilla Hörsalen, Humanisten, Renströmsgatan 6
In this work, I present a linguistic investigation of the language of Swedish textbooks in the natural sciences, i.e., biology, physics and chemistry. The textbooks, which are used in secondary and upper secondary school, are examined with respect to traditional readability measures, e.g., LIX, OVIX and nominal ratio. I also extract typical linguistic features of the texts, typicality being determined using a proposed quantitative method, labelled the index principle. This empirical, corpus-based method relies on automatic linguistic annotations produced by language technology tools to calculate what I call index lists, rank-ordered lists of characteristic linguistic features of specific text corpora as compared to reference texts.
I produce index lists for typical vocabulary, noun phrase structures and syntactic structures, extracted from a 5.2 million word textbook corpus, compiled as a part of the work presented. As well as being frequent and well dispersed, the linguistic variables selected for the index lists are also characteristic of the text type in question, as is evident when they are compared to a reference corpus, comprising textbooks in the social sciences and mathematics, as well as narrative and academic (university-level) texts.
The results show that textbooks in natural science contain a lot of content-specific, technical vocabulary. This characteristic not only distinguishes natural scientific language from everyday language, but also from social scientific language, which on the lexical level has more in common with narrative texts. On the other hand, the textbook language as a whole is structurally distinguishable from narrative texts, as clearly seen, e.g., in its noun phrase complexity.
In the transition between secondary and upper secondary school, the scores of almost every readability measure go up, indicating an increase in linguistic demands on the readers. In the upper secondary textbooks the words are longer, the vocabulary more varied, the noun phrase longer and more elaborate, and the most typical syntactic structures more complex. Notably, the linguistic development between the form levels is more marked in the natural-science textbooks, compared to social sciences and mathematics. Nevertheless, the textbook language overall shows a relatively low complexity in comparison to academic language.
Mats Wirén, Stockholm University
Date: 2015-12-04 13:15 - 16:00
Location: Lilla Hörsalen, Humanisten, Renströmsgatan 6
Computational analysis of historical and typological data has made great progress in the last fifteen years. In this thesis, I work with vocabulary lists for addressing some classical problems in historical linguistics such as cognate identification, discriminating related languages from unrelated languages, assigning possible dates to splits in a language family, and providing an internal structure to a language family. I compare the internal structure inferred from vocabulary lists with the family trees given in Ethnologue. I explore the ranking of lexical items in the widely used Swadesh word list and compare my ranking to another quantitative reranking method and short word lists composed for discovering long-distance genetic relationships. I show that the choice of string similarity measures is important for internal classification and for discriminating related from unrelated languages. The dating system presented in this thesis can be used for assigning age estimates to any new language group and overcomes the assumption of a constant rate of lexical replacement assumed by glottochronology. I train and test a linear classifier based on gap-weighted subsequence features for the purpose of cognate identification. An important conclusion from these results is that n-gram approaches can be used for different historical linguistic purposes.
Gerhard Jäger, Professor of General Linguistics, University of Tübingen
Date: 2015-11-13 13:15 - 16:00
Location: Lilla Hörsalen, Humanisten, Renströmsgatan 6
Kristina Lundholm Fors, Department of Philosophy Linguistics and Theory of Science is defending her thesis "Production and Perception of Pauses in Speech".
Silences can make or break the conversation: if two persons involved in a conversation have different ideas about the typical length of pauses, they will face problems with turn taking. Pauses occur in conversation for a number of reasons, for example for breathing, thinking, word-searching and turn taking management. In this dissertation, we explore the production and perception of pauses in speech. Our aim consists of three main parts: to describe and analyse the production of pauses, to investigate the perception of pauses, and to examine the role of pauses in turn-taking. Our hypothesis is that pauses fill varying functions, and that these functions depend on the context of the pauses. We believe that the duration of pauses may be linked to the pause type, and that we adapt the our pause lengths to the persons we are speaking to. Further, we suggest that pauses occur regularly throughout dialogues. We also hypothesise that the duration of pauses in speech affect the processing of speech.
Pauses are tied to the process of turn taking, and as we learn more about the nature of pauses we may also be able to further develop our understanding of the process of turn holding and turn yielding. We will also be able to use the information about pause production and perception when modelling turn taking in dialogue systems.
Our results show that pause lengths vary greatly across speakers, pause types and dialogues. Pauses tend to be entrained by speakers involved in dialogues, and pauses occur regularly throughout conversations. We also found evidence that pauses have a positive impact on memorising spoken utterances. While speakers adapt their pause lengths to the other speaker in the conversation, they are inclined to keep a consistent ratio between pause types, and this is not dependent on the conversational partner. While it is interesting to look at pauses separately, we need to put them into context to really understand their functions. To highlight the role of pauses in conversation, we proposed an updated turn taking model, where the results from our studies are integrated.
Keywords: pauses, silences, turn taking, dialogue, entrainment
Opponent: Anna Hjalmarsson, KTH
Link to the dissertation: http://hdl.handle.net/2077/39346
Date: 2015-09-11 13:15 - 16:00
Location: Stora Hörsalen, Humanisten, Renströmsgatan 6