Liber Quarterly: The Journal of European Research Libraries (Feb 2020)

Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process

  • Kimmo Kettunen,
  • Mika Koistinen,
  • Jukka Kervinen

DOI
https://doi.org/10.18352/lq.10322
Journal volume & issue
Vol. 30, no. 1

Abstract

Read online

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https://digi.kansalliskirjasto.fi/etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921–1929, were opened in January 2018. This paper presents briefly the ground truth Optical Character Recognition data of about 500 000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real newspaper data of 30 years and 109 million words that the re-OCRing process is improving the quality of the OCRed data.

Keywords