A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

Jocelyn Dunstan; Thomas Vakili; Luis Miranda; Fabián Villena; Claudio Aracena; Tamara Quiroga; Paulina Vera; Sebastián Viteri Valenzuela; Victor Rocco

doi:10.1186/s12911-024-02609-w

BMC Medical Informatics and Decision Making (Jul 2024)

A pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish

Jocelyn Dunstan,
Thomas Vakili,
Luis Miranda,
Fabián Villena,
Claudio Aracena,
Tamara Quiroga,
Paulina Vera,
Sebastián Viteri Valenzuela,
Victor Rocco

Affiliations

Jocelyn Dunstan: Department of Computer Sciences, Pontificia Universidad Catolica de Chile
Thomas Vakili: Department of Computer and Systems Sciences, Stockholm University
Luis Miranda: Department of Computer Sciences, Pontificia Universidad Catolica de Chile
Fabián Villena: Millennium Institute for Foundational Research on Data
Claudio Aracena: Millennium Institute for Foundational Research on Data
Tamara Quiroga: Department of Computer Sciences, Pontificia Universidad Catolica de Chile
Paulina Vera: Servicio de Salud del Maule, Ministerio de Salud
Sebastián Viteri Valenzuela: Asociación Chilena de Seguridad
Victor Rocco: Asociación Chilena de Seguridad

DOI: https://doi.org/10.1186/s12911-024-02609-w
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Despite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords