Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study

Marcos Rojas; Marcelo Rojas; Valentina Burgess; Javier Toro-Pérez; Shima Salehi

doi:10.2196/55048

JMIR Medical Education (Apr 2024)

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study

Marcos Rojas,
Marcelo Rojas,
Valentina Burgess,
Javier Toro-Pérez,
Shima Salehi

Affiliations

Marcos Rojas: ORCiD
Marcelo Rojas: ORCiD
Valentina Burgess: ORCiD
Javier Toro-Pérez: ORCiD
Shima Salehi: ORCiD

DOI: https://doi.org/10.2196/55048
Journal volume & issue: Vol. 10
pp. e55048 – e55048

Abstract

Read online

Abstract BackgroundThe deployment of OpenAI’s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as “GPT-4 Turbo With Vision”), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile’s medical licensing examinations—a critical step for medical practitioners in Chile—is less explored. This gap highlights the need to evaluate ChatGPT’s adaptability to diverse linguistic and cultural contexts. ObjectiveThis study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. MethodsThree official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM’s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. ResultsAll versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (PP ConclusionsThis study reveals ChatGPT’s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals.

Published in JMIR Medical Education

ISSN: 2369-3762 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Education: Special aspects of education; Medicine: Medicine (General)
Website: https://mededu.jmir.org

About the journal