Introduction
Article titlesare brief, descriptive and to the point, using very well-chosen, specific terminology that intend to attract the reader's attention. Factual information extraction from such article titles and the construction of structured fact data banks has a great potential to facilitate computational analysis in many areas of biomedicine and in the open domain.
Purpose
Given a Swedish collection of published article titles the purpose of this proposal is twofold:
a) to automatically classify titles into factual and non-factual. For this you will need to:
b) to extract sets of triplesfrom the factual titles and represent them in a graphical way using available software such as "visone" or "touchgraph".
A factual title in biomedicine according to Eales et al. (2011) is: "a direct (the title does not merely imply a result but actually states the result) sentential report about the outcome of a biomedical investigation". In this proposal, we take a little more general approach since our data is not strictly biomedical, but medical in general. Such results can be both a positive or negative outcome. For instance the first example below is positive and the second negative (the annotations provided below are simplified for readability):
"Antioxidanter motverkar fria radikalers nyttiga effekter" (LT nr 28–29 2009 volym 106, pp 1808)
<substance>Antioxidanter</substance> motverkar <substance>fria radikalers</substance> nyttiga <qualifier value>effekter</qualifier value>
"B12 och folat skyddar inte mot hjärt-kärlsjukdom" (LT nr 38 2010 volym 107, pp, 2228)
<substance>B12</substance> och <substance>folat</substance> skyddar inte mot <disorder>hjärt-kärlsjukdom</disorder>
A non-factual fact can be a title that does not state all the necessary contextual information in order to fully understand whether the results or implications of the finding have a factual (direct) outcome. Moreover, a non-factual fact can be one with speculative language, such as:
"Hyperemesis gravidarum kan vara ärftlig" (LT nr 22 2010 volym 107, pp 1462)
<disorder>Hyperemesis gravidarum</disorder> kan vara <qualifier value>ärftlig</qualifier value>
"Influensavaccinering av friska unga vuxna" (LT nr 14 2002 volym 107, pp 1600)
<procedure>Influensavaccinering</procedure> av <person>friska unga vuxna</person>
For training and evaluation the article title corpus needs to be suitably divided, e.g. 75%-25% into training sentences and test sentences. All will be manually annotated as "factual" or "non-factual" but the test portion will be only kept for evaluation and not used during training (e.g. feature generation).
Material
A Swedish collection of published article titles (about 1,000) will be provided in two formats, a raw (unannotated) format and an annotated version with labels from a medical ontology. Also, a few titles can be composed of several sentences and these can be a mix of factual and non-factual statements. A number of other annotation can be provided, if necessary, such as part of speech tags.
Prerequisites
Native Swedish or good Swedish language skills - all data is Swedish.
Good programming skills, interest (experience) in Machine Learning is a plus!
Supervisors
Dimitrios Kokkinakis
Richard Johansson
References
Page updated: 2014-11-12 15:13