Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries
Krithi Pushpanathan,
Zhi Wei Lim,
Samantha Min Er Yew,
David Ziyou Chen,
Hazel Anne Hui'En Lin,
Jocelyn Hui Lin Goh,
Wendy Meihua Wong,
Xiaofei Wang,
Marcus Chun Jin Tan,
Victor Teck Chang Koh,
Yih-Chung Tham
Affiliations
Krithi Pushpanathan
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Zhi Wei Lim
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Samantha Min Er Yew
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
David Ziyou Chen
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Department of Ophthalmology, National University Hospital, Singapore, Singapore
Hazel Anne Hui'En Lin
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Department of Ophthalmology, National University Hospital, Singapore, Singapore
Jocelyn Hui Lin Goh
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
Wendy Meihua Wong
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Department of Ophthalmology, National University Hospital, Singapore, Singapore
Xiaofei Wang
Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing, China; Advanced Innovation Centre for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang University, Beijing, China
Marcus Chun Jin Tan
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Department of Ophthalmology, National University Hospital, Singapore, Singapore
Victor Teck Chang Koh
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Department of Ophthalmology, National University Hospital, Singapore, Singapore
Yih-Chung Tham
Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore; Ophthalmology and Visual Sciences Academic Clinical Programme (Eye ACP), Duke NUS Medical School, Singapore, Singapore; Corresponding author
Summary: In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were ‘good’-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.