Journal of Data Mining and Digital Humanities (Mar 2024)
Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done
- C. Annemieke Romein,
- Tobias Hodel,
- Femke Gordijn,
- Joris J. van Zundert,
- Alix Chagué,
- Milan van Lange,
- Helle Strandgaard Jensen,
- Andy Stauder,
- Jake Purcell,
- Melissa M. Terras,
- Pauline van den Heuvel,
- Carlijn Keijzer,
- Achim Rabus,
- Chantal Sitaram,
- Aakriti Bhatia,
- Katrien Depuydt,
- Mary Aderonke Afolabi-Adeolu,
- Anastasiia Anikina,
- Elisa Bastianello,
- Lukas Vincent Benzinger,
- Arno Bosse,
- David Brown,
- Ash Charlton,
- André Nilsson Dannevig,
- Klaas van Gelder,
- Sabine C.P.J. Go,
- Marcus J.C. Goh,
- Silvia Gstrein,
- Sewa Hasan,
- Stefan von der Heide,
- Maximilian Hindermann,
- Dorothee Huff,
- Ineke Huysman,
- Ali Idris,
- Liesbeth Keijzer,
- Simon Kemper,
- Sanne Koenders,
- Erika Kuijpers,
- Lisette Rønsig Larsen,
- Sven Lepa,
- Tommy O. Link,
- Annelies van Nispen,
- Joe Nockels,
- Laura M. van Noort,
- Joost Johannes Oosterhuis,
- Vivien Popken,
- María Estrella Puertollano,
- Joosep J. Puusaag,
- Ahmed Sheta,
- Lex Stoop,
- Ebba Strutzenbladh,
- Nicoline van der Sijs,
- Jan Paul van der Spek,
- Barry Benaissa Trouw,
- Geertrui Van Synghel,
- Vladimir Vučković,
- Heleen Wilbrink,
- Sonia Weiss,
- David Joseph Wrisley,
- Riet Zweistra
Affiliations
- C. Annemieke Romein
- Huygens Institute for History and Culture of the Netherlands
- Tobias Hodel
- University of Bern
- Femke Gordijn
- Tilburg University
- Joris J. van Zundert
- Royal Netherlands Academy of Arts and Sciences
- Alix Chagué
- Université de Montréal
- Milan van Lange
- NIOD Institute for War, Holocaust and Genocide Studies
- Helle Strandgaard Jensen
- VUC Aarhus
- Andy Stauder
- READ-COOP SCE, Austria
- Jake Purcell
- American Historical Association
- Melissa M. Terras
- University of Edinburgh
- Pauline van den Heuvel
- Amsterdam city Archives
- Carlijn Keijzer
- Achim Rabus
- Chantal Sitaram
- Aakriti Bhatia
- Katrien Depuydt
- Mary Aderonke Afolabi-Adeolu
- Anastasiia Anikina
- Elisa Bastianello
- Lukas Vincent Benzinger
- Arno Bosse
- David Brown
- Ash Charlton
- André Nilsson Dannevig
- Klaas van Gelder
- Sabine C.P.J. Go
- Marcus J.C. Goh
- Silvia Gstrein
- Sewa Hasan
- Stefan von der Heide
- Maximilian Hindermann
- Dorothee Huff
- Ineke Huysman
- Ali Idris
- Liesbeth Keijzer
- Simon Kemper
- Sanne Koenders
- Erika Kuijpers
- Lisette Rønsig Larsen
- Sven Lepa
- Tommy O. Link
- Annelies van Nispen
- Joe Nockels
- Laura M. van Noort
- Joost Johannes Oosterhuis
- Vivien Popken
- María Estrella Puertollano
- Joosep J. Puusaag
- Ahmed Sheta
- Lex Stoop
- Ebba Strutzenbladh
- Nicoline van der Sijs
- Jan Paul van der Spek
- Barry Benaissa Trouw
- Geertrui Van Synghel
- Vladimir Vučković
- Heleen Wilbrink
- Sonia Weiss
- David Joseph Wrisley
- Riet Zweistra
- DOI
- https://doi.org/10.46298/jdmdh.10403
- Journal volume & issue
-
Vol. Historical Documents and...
Abstract
This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.
Keywords
- automatic text recognition
- handwritten text recognition
- data publication
- open data
- data provenance
- data curation
- ground truth
- sharing