Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review

Cindy N. Ho; Tiffany Tian; Alessandra T. Ayers; Rachel E. Aaron; Vidith Phillips; Risa M. Wolf; Nestoras Mathioudakis; Tinglong Dai; David C. Klonoff

doi:10.1186/s12911-024-02757-z

BMC Medical Informatics and Decision Making (Nov 2024)

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review

Cindy N. Ho,
Tiffany Tian,
Alessandra T. Ayers,
Rachel E. Aaron,
Vidith Phillips,
Risa M. Wolf,
Nestoras Mathioudakis,
Tinglong Dai,
David C. Klonoff

Affiliations

Cindy N. Ho: Diabetes Technology Society
Tiffany Tian: Diabetes Technology Society
Alessandra T. Ayers: Diabetes Technology Society
Rachel E. Aaron: Diabetes Technology Society
Vidith Phillips: School of Medicine, Johns Hopkins University
Risa M. Wolf: Division of Pediatric Endocrinology, The Johns Hopkins Hospital
Nestoras Mathioudakis: School of Medicine, Johns Hopkins University
Tinglong Dai: Hopkins Business of Health Initiative, Johns Hopkins University
David C. Klonoff: Diabetes Research Institute, Mills-Peninsula Medical Center

DOI: https://doi.org/10.1186/s12911-024-02757-z
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. Methods We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. Results We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were “accuracy”, “completeness”, “appropriateness”, “insight”, and “consistency”. Conclusions The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords