• Home
  • CLT Seminar: Paul Rayson (Univ. Lancaster) - Historical text mining: to bee or not to be?

CLT Seminar: Paul Rayson (Univ. Lancaster) - Historical text mining: to bee or not to be?

VISIT

In this talk, I will describe our research on spelling variation in Early Modern English (EmodE). Our original motivation was on retraining a semantic field annotation system (USAS, first designed for modern English) for historical corpora. Now, the research has taken a diversion into the detection of historical spelling variants as they cause significant problems for corpus-based computational linguistics techniques and tools. I will quantify the extent of the problem in a variety of EmodE corpora and show how it affects simple procedures such as key word comparisons. Our solution will be described, a corpus pre-processing tool called VARD (the Variant Detector), which uses techniques adapted from modern spell checkers. VARD offers modern equivalents for historical variants with high levels of precision and recall. The evaluation will focus on how much training data is required for such a system.

Date: 2010-03-25 10:15 - 12:00

Location: room L307, Lennart Torstenssonsgatan 8

Permalink

add to Outlook/iCal

To the top

Page updated: 2010-03-22 14:48

Send as email
Print page
Show as pdf

X
Loading