Extracting patient lifestyle characteristics from Dutch clinical text with BERT models

Hielke Muizelaar; Marcel Haas; Koert van Dortmont; Peter van der Putten; Marco Spruit

doi:10.1186/s12911-024-02557-5

BMC Medical Informatics and Decision Making (Jun 2024)

Extracting patient lifestyle characteristics from Dutch clinical text with BERT models

Hielke Muizelaar,
Marcel Haas,
Koert van Dortmont,
Peter van der Putten,
Marco Spruit

Affiliations

Hielke Muizelaar: LIACS, Leiden University
Marcel Haas: Department of Public Health and Primary Care, Leiden University Medical Center
Koert van Dortmont: Department of Business Intelligence, HagaZiekenhuis
Peter van der Putten: LIACS, Leiden University
Marco Spruit: LIACS, Leiden University

DOI: https://doi.org/10.1186/s12911-024-02557-5
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Background BERT models have seen widespread use on unstructured text within the clinical domain. However, little to no research has been conducted into classifying unstructured clinical notes on the basis of patient lifestyle indicators, especially in Dutch. This article aims to test the feasibility of deep BERT models on the task of patient lifestyle classification, as well as introducing an experimental framework that is easily reproducible in future research. Methods This study makes use of unstructured general patient text data from HagaZiekenhuis, a large hospital in The Netherlands. Over 148 000 notes were provided to us, which were each automatically labelled on the basis of the respective patients’ smoking, alcohol usage and drug usage statuses. In this paper we test feasibility of automatically assigning labels, and justify it using hand-labelled input. Ultimately, we compare macro F1-scores of string matching, SGD and several BERT models on the task of classifying smoking, alcohol and drug usage. We test Dutch BERT models and English models with translated input. Results We find that our further pre-trained MedRoBERTa.nl-HAGA model outperformed every other model on smoking (0.93) and drug usage (0.77). Interestingly, our ClinicalBERT model that was merely fine-tuned on translated text performed best on the alcohol task (0.80). In t-SNE visualisations, we show our MedRoBERTa.nl-HAGA model is the best model to differentiate between classes in the embedding space, explaining its superior classification performance. Conclusions We suggest MedRoBERTa.nl-HAGA to be used as a baseline in future research on Dutch free text patient lifestyle classification. We furthermore strongly suggest further exploring the application of translation to input text in non-English clinical BERT research, as we only translated a subset of the full set and yet achieved very promising results.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords