• Home
  • Fact extraction from (bio)medical article titles

Fact extraction from (bio)medical article titles


Article titlesare brief, descriptive and to the point, using very well-chosen, specific terminology that intend to attract the reader's attention. Factual information extraction from such article titles and the construction of structured fact data banks has a great potential to facilitate computational analysis in many areas of biomedicine and in the open domain.


Given a Swedish collection of published article titles the purpose of this proposal is twofold:

a) to automatically classify titles into factual and non-factual. For this you will need to:

  • write some simple guidelines that will help you differentiate between factual/non factual instances/examples
  • annotate titles as factual or not
  • decide and extract suitable attributes (features) such as verbs, n-grams etc.
  • experiment with one (or more) machine learning algorithms
  • evaluate and report results

b) to extract sets of triplesfrom the factual titles and represent them in a graphical way using available software such as "visone" or "touchgraph".

A factual title in biomedicine according to Eales et al. (2011) is: "a direct (the title does not merely imply a result but actually states the result) sentential report about the outcome of a biomedical investigation". In this proposal, we take a little more general approach since our data is not strictly biomedical, but medical in general. Such results can be both a positive or negative outcome. For instance the first example below is positive and the second negative (the annotations provided below are simplified for readability):

"Antioxidanter motverkar fria radikalers nyttiga effekter" (LT nr 28–29 2009 volym 106, pp 1808)

<substance>Antioxidanter</substance> motverkar <substance>fria radikalers</substance> nyttiga <qualifier value>effekter</qualifier value>

"B12 och folat skyddar inte mot hjärt-kärlsjukdom" (LT nr 38 2010 volym 107, pp, 2228)

<substance>B12</substance> och <substance>folat</substance> skyddar inte mot <disorder>hjärt-kärlsjukdom</disorder>

A non-factual fact can be a title that does not state all the necessary contextual information in order to fully understand whether the results or implications of the finding have a factual (direct) outcome. Moreover, a non-factual fact can be one with speculative language, such as:

"Hyperemesis gravidarum kan vara ärftlig" (LT nr 22 2010 volym 107, pp 1462)

<disorder>Hyperemesis gravidarum</disorder> kan vara <qualifier value>ärftlig</qualifier value>

"Influensavaccinering av friska unga vuxna" (LT nr 14 2002 volym 107, pp 1600)

<procedure>Influensavaccinering</procedure> av <person>friska unga vuxna</person>

For training and evaluation the article title corpus needs to be suitably divided, e.g. 75%-25% into training sentences and test sentences. All will be manually annotated as "factual" or "non-factual" but the test portion will be only kept for evaluation and not used during training (e.g. feature generation).


A Swedish collection of published article titles (about 1,000) will be provided in two formats, a raw (unannotated) format and an annotated version with labels from a medical ontology. Also, a few titles can be composed of several sentences and these can be a mix of factual and non-factual statements. A number of other annotation can be provided, if necessary, such as part of speech tags.


Native Swedish or good Swedish language skills - all data is Swedish.

Good programming skills, interest (experience) in Machine Learning is a plus!


Dimitrios Kokkinakis

Richard Johansson


  1. Eales J., Demetriou G. and Stevens R. 2011. Creating a focused corpus of factual outcomes from biomedical experiments. Proceedings of the Mining Complex Entities from Network and Biomedical Data. Athens, Greece.
  2. Kastner I. and Monz C. 2009. Automatic Single-Document Key Fact Extraction from Newswire Articles. Proceedings of the 12th Conference of the European Chapter of the ACL (EACL). Athens, Greece.
  3. Kilicoglu H. and Bergler S. 2008. Recognizing speculative language in biomedical research articles: a linguistically motivated perspective. BMC Bioinformatics 2008, 9(Suppl 11):S10.


To the top

Page updated: 2014-11-12 15:13

Send as email
Print page
Show as pdf