Closing the gap between open source and commercial large language models for medical evidence summarization

Gongbo Zhang; Qiao Jin; Yiliang Zhou; Song Wang; Betina Idnay; Yiming Luo; Elizabeth Park; Jordan G. Nestor; Matthew E. Spotnitz; Ali Soroush; Thomas R. Campion; Zhiyong Lu; Chunhua Weng; Yifan Peng

doi:10.1038/s41746-024-01239-w

npj Digital Medicine (Sep 2024)

Closing the gap between open source and commercial large language models for medical evidence summarization

Gongbo Zhang,
Qiao Jin,
Yiliang Zhou,
Song Wang,
Betina Idnay,
Yiming Luo,
Elizabeth Park,
Jordan G. Nestor,
Matthew E. Spotnitz,
Ali Soroush,
Thomas R. Campion,
Zhiyong Lu,
Chunhua Weng,
Yifan Peng

Affiliations

Gongbo Zhang: Department of Biomedical Informatics, Columbia University
Qiao Jin: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Yiliang Zhou: Department of Population Health Sciences, Weill Cornell Medicine
Song Wang: Cockrell School of Engineering, The University of Texas at Austin
Betina Idnay: Department of Biomedical Informatics, Columbia University
Yiming Luo: Department of Medicine, Columbia University
Elizabeth Park: Department of Medicine, Columbia University
Jordan G. Nestor: Department of Medicine, Columbia University
Matthew E. Spotnitz: Office of the Director, National Institutes of Health
Ali Soroush: Division of Data-Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai
Thomas R. Campion: Department of Population Health Sciences, Weill Cornell Medicine
Zhiyong Lu: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Chunhua Weng: Department of Biomedical Informatics, Columbia University
Yifan Peng: Department of Population Health Sciences, Weill Cornell Medicine

DOI: https://doi.org/10.1038/s41746-024-01239-w
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 8

Abstract

Read online

Abstract Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal