Evaluating large language models on medical evidence summarization

Liyan Tang; Zhaoyi Sun; Betina Idnay; Jordan G. Nestor; Ali Soroush; Pierre A. Elias; Ziyang Xu; Ying Ding; Greg Durrett; Justin F. Rousseau; Chunhua Weng; Yifan Peng

doi:10.1038/s41746-023-00896-7

npj Digital Medicine (Aug 2023)

Evaluating large language models on medical evidence summarization

Liyan Tang,
Zhaoyi Sun,
Betina Idnay,
Jordan G. Nestor,
Ali Soroush,
Pierre A. Elias,
Ziyang Xu,
Ying Ding,
Greg Durrett,
Justin F. Rousseau,
Chunhua Weng,
Yifan Peng

Affiliations

Liyan Tang: School of Information, The University of Texas at Austin
Zhaoyi Sun: Department of Population Health Sciences, Weill Cornell Medicine
Betina Idnay: Department of Biomedical Informatics, Columbia University
Jordan G. Nestor: Department of Medicine, Columbia University
Ali Soroush: Department of Medicine, Columbia University
Pierre A. Elias: Department of Biomedical Informatics, Columbia University
Ziyang Xu: Department of Medicine, Massachusetts General Hospital
Ying Ding: School of Information, The University of Texas at Austin
Greg Durrett: Department of Computer Science, The University of Texas at Austin
Justin F. Rousseau: Departments of Population Health and Neurology, Dell Medical School, The University of Texas at Austin
Chunhua Weng: Department of Biomedical Informatics, Columbia University
Yifan Peng: Department of Population Health Sciences, Weill Cornell Medicine

DOI: https://doi.org/10.1038/s41746-023-00896-7
Journal volume & issue: Vol. 6, no. 1
pp. 1 – 8

Abstract

Read online

Abstract Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal