Performance of large language models on advocating the management of meningitis: a comparative qualitative stud

Paulina Kliem; Raoul Sutter; Urs Fisch; Pascale Grzonka

doi:10.1136/bmjhci-2023-100978

BMJ Health & Care Informatics (Feb 2024)

Performance of large language models on advocating the management of meningitis: a comparative qualitative stud

Paulina Kliem,
Raoul Sutter,
Urs Fisch,
Pascale Grzonka

Affiliations

Paulina Kliem: 2 University of Basel, Basel, Switzerland
Raoul Sutter: Intensive Care Unit, Department of Acute Medicine, University Hospital Basel, Basel, Switzerland
Urs Fisch: Department of Clinical Research, University Hospital Basel, Basel, Switzerland
Pascale Grzonka: Clinic for Intensive Care Medicine, University Hospital Basel, Basel, Switzerland

DOI: https://doi.org/10.1136/bmjhci-2023-100978
Journal volume & issue: Vol. 31, no. 1

Abstract

Read online

Objectives We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare.Methods A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines.Results A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs’ text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance.Discussion Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM’s unique algorithm rather than output length.Conclusions Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information.

Published in BMJ Health & Care Informatics

ISSN: 2632-1009 (Online)
Publisher: BMJ Publishing Group
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://informatics.bmj.com/

About the journal