Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Bjørn-Richard Pedersen; Einar Holsbø; Trygve Andersen; Nikita Shvetsov; Johan Ravn; Hilde Leikny Sommerseth; Lars Ailo Bongo

doi:10.51964/hlcs11331

Historical Life Course Studies (Jan 2022)

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Bjørn-Richard Pedersen,
Einar Holsbø,
Trygve Andersen,
Nikita Shvetsov,
Johan Ravn,
Hilde Leikny Sommerseth,
Lars Ailo Bongo

Affiliations

Bjørn-Richard Pedersen: Norwegian Historical Data Centre, UiT The Arctic University of Norway
Einar Holsbø: Department of Computer Science, UiT The Arctic University of Norway
Trygve Andersen: Norwegian Historical Data Centre, UiT The Arctic University of Norway
Nikita Shvetsov: Department of Computer Science, UiT The Arctic University of Norway
Johan Ravn: Medsensio AS, Tromsø, Norway
Hilde Leikny Sommerseth: Norwegian Historical Data Centre, UiT The Arctic University of Norway
Lars Ailo Bongo: Department of Computer Science, UiT The Arctic University of Norway

DOI: https://doi.org/10.51964/hlcs11331
Journal volume & issue: Vol. 12

Abstract

Read online

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification . We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at https://github.com/uit-hdl/rhd-codes.

Published in Historical Life Course Studies

ISSN: 2352-6343 (Online)
Publisher: International Instititute of Social History
Country of publisher: Netherlands
LCC subjects: Social Sciences: Economic theory. Demography
Website: http://www.ehps-net.eu/journal

About the journal

Abstract

Keywords