Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google Bard with Iranian students in pre-internship comprehensive exams

Soolmaz Zare; Soheil Vafaeian; Mitra Amini; Keyvan Farhadi; Mohammadreza Vali; Ali Golestani

doi:10.1038/s41598-024-79335-w

Scientific Reports (Nov 2024)

Comparing the performance of ChatGPT-3.5-Turbo, ChatGPT-4, and Google Bard with Iranian students in pre-internship comprehensive exams

Soolmaz Zare,
Soheil Vafaeian,
Mitra Amini,
Keyvan Farhadi,
Mohammadreza Vali,
Ali Golestani

Affiliations

Soolmaz Zare: Clinical Education Research Center, Shiraz University of Medical Sciences
Soheil Vafaeian: School of Medicine, Shiraz University of Medical Sciences
Mitra Amini: Clinical Education Research Center, Shiraz University of Medical Sciences
Keyvan Farhadi: School of Medicine, Shiraz University of Medical Sciences
Mohammadreza Vali: School of Medicine, Shiraz University of Medical Sciences
Ali Golestani: Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute, Tehran University of Medical Sciences

DOI: https://doi.org/10.1038/s41598-024-79335-w
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 10

Abstract

Read online

Abstract This study aims to measure the performance of different AI-language models in three sets of pre-internship medical exams and to compare their performance with Iranian medical students. Three sets of Persian pre-internship exams were used, along with their English translation (six sets in total). In late September 2023, we sent requests to ChatGPT-3.5-Turbo-0613, GPT-4-0613, and Google Bard in both Persian and English languages (excluding questions with any visual content) with each query in a new session and reviewed their responses. GPT models produced responses at varying levels of randomness. In both Persian and English tests, GPT-4 ranked first and obtained the highest score in all exams and different levels of randomness. While Google Bard scored below average on the Persian exams (still in an acceptable range), ChatGPT-3.5 failed all exams. There was a significant difference between the Large Language Models (LLMs) in Persian exams. While GPT-4 yielded the best scores on the English exams, the distinction between all LLMs and students was not statistically significant. The GPT-4 model outperformed students and other LLMs in medical exams, highlighting its potential application in the medical field. However, more research is needed to fully understand and address the limitations of using these models.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords