Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Vasileios Ntinopoulos; Hector Rodriguez Cetina Biefer; Igor Tudorache; Nestoras Papadopoulos; Dragan Odavic; Petar Risteski; Achim Haeussler; Omer Dzemali

doi:10.1136/bmjhci-2024-101139

BMJ Health & Care Informatics (Jan 2025)

Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation

Vasileios Ntinopoulos,
Hector Rodriguez Cetina Biefer,
Igor Tudorache,
Nestoras Papadopoulos,
Dragan Odavic,
Petar Risteski,
Achim Haeussler,
Omer Dzemali

Affiliations

Vasileios Ntinopoulos: Department of Cardiac Surgery, Municipal Hospital of Zurich – Triemli, Zurich, Switzerland
Hector Rodriguez Cetina Biefer: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Igor Tudorache: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Nestoras Papadopoulos: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Dragan Odavic: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Petar Risteski: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Achim Haeussler: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland
Omer Dzemali: Department of Cardiac Surgery, University Hospital of Zurich, Zurich, Switzerland

DOI: https://doi.org/10.1136/bmjhci-2024-101139
Journal volume & issue: Vol. 32, no. 1

Abstract

Read online

Objectives We aimed to evaluate the performance of multiple large language models (LLMs) in data extraction from unstructured and semi-structured electronic health records.Methods 50 synthetic medical notes in English, containing a structured and an unstructured part, were drafted and evaluated by domain experts, and subsequently used for LLM-prompting. 18 LLMs were evaluated against a baseline transformer-based model. Performance assessment comprised four entity extraction and five binary classification tasks with a total of 450 predictions for each LLM. LLM-response consistency assessment was performed over three same-prompt iterations.Results Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b exhibited an excellent overall accuracy >0.98 (0.995, 0.988, 0.988, 0.988, 0.986, 0.982, 0.982, and 0.982, respectively), significantly higher than the baseline RoBERTa model (0.742). Claude 2.0, Claude 2.1, Claude 3.0 Opus, PaLM 2 chat-bison, GPT 4, Claude 3.0 Sonnet and Llama 3-70b showed a marginally higher and Gemini Advanced a marginally lower multiple-run consistency than the baseline model RoBERTa (Krippendorff’s alpha value 1, 0.998, 0.996, 0.996, 0.992, 0.991, 0.989, 0.988, and 0.985, respectively).Discussion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat bison and Llama 3-70b performed the best, exhibiting outstanding performance in both entity extraction and binary classification, with highly consistent responses over multiple same-prompt iterations. Their use could leverage data for research and unburden healthcare professionals. Real-data analyses are warranted to confirm their performance in a real-world setting.Conclusion Claude 3.0 Opus, Claude 3.0 Sonnet, Claude 2.0, GPT 4, Claude 2.1, Gemini Advanced, PaLM 2 chat-bison and Llama 3-70b seem to be able to reliably extract data from unstructured and semi-structured electronic health records. Further analyses using real data are warranted to confirm their performance in a real-world setting.

Published in BMJ Health & Care Informatics

ISSN: 2632-1009 (Online)
Publisher: BMJ Publishing Group
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://informatics.bmj.com/

About the journal