BMC Medical Education (Nov 2024)

Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study

  • Yikai Chen,
  • Xiujie Huang,
  • Fangjie Yang,
  • Haiming Lin,
  • Haoyu Lin,
  • Zhuoqun Zheng,
  • Qifeng Liang,
  • Jinhai Zhang,
  • Xinxin Li

DOI
https://doi.org/10.1186/s12909-024-06309-x
Journal volume & issue
Vol. 24, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Background This study aimed to evaluate the performance of GPT-3.5, GPT-4, GPT-4o and Google Bard on the United States Medical Licensing Examination (USMLE), the Professional and Linguistic Assessments Board (PLAB), the Hong Kong Medical Licensing Examination (HKMLE) and the National Medical Licensing Examination (NMLE). Methods This study was conducted in June 2023. Four LLMs (Large Language Models) (GPT-3.5, GPT-4, GPT-4o and Google Bard) were applied to four medical standardized tests (USMLE, PLAB, HKMLE and NMLE). All questions are multiple-choice questions and were sourced from the question banks of these examinations. Results In USMLE step 1, step 2CK and Step 3, there are accuracy rates of 91.5%, 94.2% and 92.7% provided from GPT-4o, 93.2%, 95.0% and 92.0% provided from GPT-4, 65.6%, 71.6% and 68.5% provided from GPT-3.5, and 64.3%, 55.6%, 58.1% from Google Bard, respectively. In PLAB, HKMLE and NMLE, GPT-4o scored 93.3%, 91.7% and 84.9%, GPT-4 scored 86.7%, 89.6% and 69.8%, GPT-3.5 scored 80.0%, 68.1% and 60.4%, and Google Bard scored 54.2%, 71.7% and 61.3%. There was significant difference in the accuracy rates of four LLMs in the four medical licensing examinations. Conclusion GPT-4o performed better in the medical licensing examinations than other three LLMs. The performance of the four models in the NMLE examination needs further improvement. Clinical trial number Not applicable.

Keywords