Benchmarking four large language models’ performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study

Runhan Shi; Steven Liu; Xinwei Xu; Zhengqiang Ye; Jin Yang; Qihua Le; Jini Qiu; Lijia Tian; Anji Wei; Kun Shan; Chen Zhao; Xinghuai Sun; Xingtao Zhou; Jiaxu Hong

Heliyon (Jul 2024)

Benchmarking four large language models’ performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study

Runhan Shi,
Steven Liu,
Xinwei Xu,
Zhengqiang Ye,
Jin Yang,
Qihua Le,
Jini Qiu,
Lijia Tian,
Anji Wei,
Kun Shan,
Chen Zhao,
Xinghuai Sun,
Xingtao Zhou,
Jiaxu Hong

Affiliations

Runhan Shi: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China; NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China; Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, 200032, China; Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China
Steven Liu: Department of Statistics, College of Liberal Arts & Sciences, University of Illinois Urbana-Champaign, Illinois, USA
Xinwei Xu: Faculty of Business and Economics, Hong Kong University, Hong Kong Special Administrative Region, China
Zhengqiang Ye: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Jin Yang: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Qihua Le: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Jini Qiu: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Lijia Tian: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Anji Wei: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Kun Shan: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Chen Zhao: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Xinghuai Sun: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Xingtao Zhou: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China
Jiaxu Hong: Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China; NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China; Shanghai Engineering Research Center of Synthetic Immunology, Shanghai, 200032, China; Department of Ophthalmology, Children's Hospital of Fudan University, National Pediatric Medical Center of China, Shanghai, China; Corresponding author. Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China.

Journal volume & issue: Vol. 10, no. 14
p. e34391

Abstract

Read online

Purpose: To evaluate the performance of four large language models (LLMs)—GPT-4, PaLM 2, Qwen, and Baichuan 2—in generating responses to inquiries from Chinese patients about dry eye disease (DED). Design: Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase. Subjects: Eight board-certified ophthalmologists and 46 patients with DED. Methods: The chatbots' responses to Chinese patients' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots’ responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains. Main outcome measures: Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase. Results: In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, p < 0.05). However, the readability analysis revealed that GPT-4's responses were highly complex, with an average score of 12.86 (p < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2's responses was relatively poor (4.04 vs. 4.48 for GPT-4, p < 0.05). Nevertheless, Baichuan 2's recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, p < 0.05; ophthalmologist readability: 2.67 vs. 4.33). Conclusions: The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords