Journal of Open Humanities Data (Feb 2023)

MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library

  • Sil Hamilton,
  • Andrew Piper

DOI
https://doi.org/10.5334/johd.95
Journal volume & issue
Vol. 9
pp. 3 – 3

Abstract

Read online

This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.

Keywords