OCR Correction for Corpus-assisted Discourse Studies: A Case Study of Old Newspapers

Dario Del Fante; Giorgio Maria Di Nunzio

doi:10.6092/issn.2532-8816/13689

Umanistica Digitale (Jan 2022)

OCR Correction for Corpus-assisted Discourse Studies: A Case Study of Old Newspapers

Dario Del Fante,
Giorgio Maria Di Nunzio

Affiliations

Dario Del Fante: Istituto di Linguistica Computazionale “A.Zampolli” - Consiglio Nazionale delle Ricerche
Giorgio Maria Di Nunzio: ORCiD; Università di Padova

DOI: https://doi.org/10.6092/issn.2532-8816/13689
Journal volume & issue: no. 11
pp. 99 – 124

Abstract

Read online

The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpus-assisted discourse Studies. However, OCR software is not totally accurate, and the resulting error rate may compromise the qualitative analysis of the studies. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for enhancing the quality of historical corpora. We applied the developed methodology to two case studies on newspapers of the beginning of the 20th century for the linguistic analysis of the metaphors representing migration and pandemics. The outcome of this project consists in a set of rules which are, eventually, valid for different contexts and applicable to different corpora and which can be reproduced and reused. The proposed procedure, in terms of computational readability, is aimed at making more readable and searchable the vast array of historical text corpora which are, at the moment, only partially usable given the high error rate introduced by an OCR software.

Published in Umanistica Digitale

ISSN: 2532-8816 (Online)
Publisher: University of Bologna
Country of publisher: Italy
LCC subjects: General Works: History of scholarship and learning. The humanities
Website: http://umanisticadigitale.unibo.it/

About the journal

Abstract

Keywords