iScience (Nov 2023)

Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

  • Krithi Pushpanathan,
  • Zhi Wei Lim,
  • Samantha Min Er Yew,
  • David Ziyou Chen,
  • Hazel Anne Hui'En Lin,
  • Jocelyn Hui Lin Goh,
  • Wendy Meihua Wong,
  • Xiaofei Wang,
  • Marcus Chun Jin Tan,
  • Victor Teck Chang Koh,
  • Yih-Chung Tham

Journal volume & issue
Vol. 26, no. 11
p. 108163

Abstract

Read online

Summary: In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were ‘good’-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.

Keywords