A Language Modelling Approach to Quality Assessment of OCR'ed Historical Text (LREC 2022) ➚

Callum W Booth, Robert Shoemaker, Robert Gaizauskas

Abstract: We hypothesise and evaluate a language model-based approach for scoring the quality of OCR transcriptions in the British Library Newspapers (BLN) corpus parts 1 and 2, to identify the best quality OCR for use in further natural language processing tasks, with a wider view to link individual newspaper reports of crime in nineteenth-century London to the Digital Panopticon—a structured repository of criminal lives. We mitigate the absence of gold standard transcriptions of the BLN corpus by utilising a corpus of genre-adjacent texts that capture the common and legal parlance of nineteenth-century London—the Proceedings of the Old Bailey Online—with a view to rank the BLN transcriptions by their OCR quality.

Citation

Callum Booth, Robert Shoemaker, and Robert Gaizauskas. 2022. A Language Modelling Approach to Quality Assessment of OCR’ed Historical Text. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5859–5864, Marseille, France. European Language Resources Association.

Bibtex

UKRI Centre for Doctoral Training in Speech and Language Technologies and their Applications - Annual Conference 2021 - Poster Session: Information Extraction and Entity Linkage in Historical Crime Records: OCR quality scoring and post-correction (2021) ➚

Abstract: This research seeks to find methodology to parse crime reports within the OCR texts of nineteenth-century London newspapers in the British Library Newspapers (BLN) corpus. We seek to corroborate the existing information in the Digital Panopticon, a structured repository of criminal histories, by using newspaper reports of police court hearings to shed light on the criminal justice processes that took place before a case was tried in the Old Bailey, giving historians structured access to a valuable additional source of crime data. This work covers the methodology carried out to identify and rank high quality OCR documents within the BLN corpus, by using genre-adjacent language modelling. We examine the landscape of the available data from a time and publication perspective, and introduce rules to reduce the dataset down to a working corpus of relevant and high quality documents. Finally, we explore means of mitigating the issue of transcription noise introduced by the effects of physical source degradation, microfilm scan quality, and varying OCR tooling, by exploring heuristic and neural OCR post-correction methods.

Institute of Historial Research Digital History Postgraduate Seminar: Information Extraction and Entity Linkage in Historical Crime Records (2020) ➚

Abstract: This research seeks to discover and invent methodology to parse en-masse a subset of the British Library Newspapers set of digitised newspapers: crime reports in nineteenth century London. The goal of this research is to corroborate and augment the existing information in the Digital Panopticon, by using newspaper reports of police court hearings to shed light on the criminal justice processes that took place before a case made it to the Old Bailey, giving historians structured access to a valuable additional source of crime data. This presentation will cover the project’s progress so far, from initial named entity recognition and entity linkage experimentation, and the research currently being carried out to help alleviate some of the pitfalls of these processes.

Undergraduate Dissertation: “Triple Scoring: Scoring and ranking the truth of factual triples” (2018) ➚

Abstract: In a lot of cases, what we consider a fact essentially boils down to “x is y”. Facts such as “the sky is blue”, “space is cold”, “the universe is huge”, all essentially follow this format: subject-relation-object. This is advantageous, as it allows us to represent facts in a very mathematical way, one which can be parsed by a computer. This project aims to create a model that can assign a numeric truth score to a type-like relation triple. These triples take the form of (subject, relation, object), and can be used to represent a fact. In this project, a triple scoring model is designed for the WSDM Cup 2017 Triple Scoring task. The model utilises large corpora, and performs natural language processing and information retrieval techniques to corroborate and rank facts, culminating in a model that achieves an 8th place result out of a pool of 21 solutions.