Zeitschrift für digitale Geisteswissenschaften (Sep 2021)

Publishing an OCR ground truth data set for reuse in an unclear copyright setting. Two case studies with legal and technical solutions to enable a collective OCR ground truth data set effort

  • David Lassner,
  • Julius Coburger,
  • Clemens Neudecker,
  • Anne Baillot

DOI
https://doi.org/10.17175/sb005_006
Journal volume & issue
Vol. 5, no. 6

Abstract

Read online

We present an OCR ground truth data set for historical prints and show improvement of recognition results over baselines with training on this data. We reflect on reusability of the ground truth data set based on two experiments that look into the legal basis for reuse of digitized document images in the case of 19th century English and German books. We propose a framework for publishing ground truth data even when digitized document images cannot be easily redistributed.

Keywords