Exploring the opportunities of large language models for summarizing palliative care consultations: A pilot comparative study

Xiao Chen; Wei Zhou; Rashina Hoda; Andy Li; Chris Bain; Peter Poon

doi:10.1177/20552076241293932

Digital Health (Nov 2024)

Exploring the opportunities of large language models for summarizing palliative care consultations: A pilot comparative study

Xiao Chen,
Wei Zhou,
Rashina Hoda,
Andy Li,
Chris Bain,
Peter Poon

Affiliations

Xiao Chen: School of Information and Physical Sciences, , Callaghan, NSW, Australia
Wei Zhou: Faculty of Information Technology, , Clayton, VIC, Australia
Rashina Hoda: Faculty of Information Technology, , Clayton, VIC, Australia
Andy Li: Faculty of Information Technology, , Clayton, VIC, Australia
Chris Bain: Faculty of Information Technology, , Clayton, VIC, Australia
Peter Poon: Monash Health, Clayton, VIC, Australia

DOI: https://doi.org/10.1177/20552076241293932
Journal volume & issue: Vol. 10

Abstract

Read online

Introduction Recent developments in the field of large language models have showcased impressive achievements in their ability to perform natural language processing tasks, opening up possibilities for use in critical domains like telehealth. We conducted a pilot study on the opportunities of utilizing large language models, specifically GPT-3.5, GPT-4, and LLaMA 2, in the context of zero-shot summarization of doctor–patient conversation during a palliative care teleconsultation. Methods We created a bespoke doctor–patient conversation to evaluate the quality of medical conversation summarization, employing established automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore for quality assessment, and using the Flesch-Kincaid grade Level for readability to understand the efficacy and suitability of these models in the medical domain. Results For automatic metrics, LLaMA2-7B scored the highest in BLEU, indicating strong n-gram precision, while GPT-4 excelled in both ROUGE-L and METEOR, demonstrating its capability to capture longer sequences and semantic accuracy. GPT-4 also led in BERTScore, suggesting better semantic similarity at the token level compared to others. For readability, LLaMA 7B and LLaMA 13B produced summaries with Flesch-Kincaid grade levels of 11.9 and 12.6, respectively, which are somewhat more complex than the reference value of 10.6. LLaMA 70B generated summaries closest to the reference in simplicity, with a score of 10.7. GPT-3.5’s summaries were the most complex at a grade level of 15.2, while GPT-4’s summaries had a grade level of 13.1, making them moderately accessible. Conclusion Our findings indicate that all the models have similar performance for the palliative care consultation, with GPT-4 being slightly better at balancing understanding content and maintaining structural similarity to the source, which makes it a potentially better choice for creating patient-friendly medical summaries. Threats and limitations of such approaches are also embedded in our analysis.

Published in Digital Health

ISSN: 2055-2076 (Online)
Publisher: SAGE Publishing
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://journals.sagepub.com/home/dhj

About the journal