Master thesis project: OCR error correction and segmentation
Goal: Creating better quality text by finding good methods for OCR error correction and text segmentation.
Background: Long term textual archives, spanning over decades or centuries, have the potential of answering many interesting research questions. One can look for changes in language and culture, author influence, writing standards among others. One difficulty in working with historical documents is the quality of the data. Many long term archives contain scanned documents from times when the used OCR technology was far from perfect. In addition, the quality of the documents leaves much to wish for. Due to the vast amount of time needed for this manual procedure, the OCR is not redone. Instead the errors should be corrected using rule-based systems and machine learning techniques. Beyond the problems with OCR, many co-occurrence statistics requires a definition of sentence or paragraph, which in its self can be very challenges in certain times.
Are You interested in working with a large digitized textual archive and develop techniques for correcting OCR errors for documents over 200 years old? Want to find ways to define sentences and paragraphs? Want to be part of a research team working with a new exciting research topic?
Problem description: Applying rule-based and statistical machine learning techniques to improve the quality of a large newspaper archive. Improvements will later be used by Språkbanken and reflected for the archive users.
Recommended skills: Interest in rule-based systems and statistical machine learning techniques Some mathematical background Programming skills A highly motivated 5th year student.
Supervisors: If you are interested or have questions, please contact one of the following: Nina Tahmasebi, Phone:031-786 6953, Email: HYPERLINK "mailto:firstname.lastname@example.org" email@example.com