Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?

Marwa Saad; Wesam Almasri; Tanvirul Hye; Monzurul Roni; Changiz Mohiyeddini

doi:10.3390/a17100469

Algorithms (Oct 2024)

Analysis of ChatGPT-3.5’s Potential in Generating NBME-Standard Pharmacology Questions: What Can Be Improved?

Marwa Saad,
Wesam Almasri,
Tanvirul Hye,
Monzurul Roni,
Changiz Mohiyeddini

Affiliations

Marwa Saad: Department of Foundational Medical Studies, Oakland University William Beaumont School of Medicine, 586 Pioneer Drive, Rochester, MI 48309, USA
Wesam Almasri: Department of Foundational Medical Studies, Oakland University William Beaumont School of Medicine, 586 Pioneer Drive, Rochester, MI 48309, USA
Tanvirul Hye: College of Medicine, Rosman University, Henderson, NV 89014, USA
Monzurul Roni: College of Medicine Peoria, University of Illinois, Peoria, IL 61605, USA
Changiz Mohiyeddini: Department of Foundational Medical Studies, Oakland University William Beaumont School of Medicine, 586 Pioneer Drive, Rochester, MI 48309, USA

DOI: https://doi.org/10.3390/a17100469
Journal volume & issue: Vol. 17, no. 10
p. 469

Abstract

Read online

ChatGPT by OpenAI is an AI model designed to generate human-like responses based on diverse datasets. Our study evaluated ChatGPT-3.5’s capability to generate pharmacology multiple-choice questions adhering to the NBME guidelines for USMLE Step exams. The initial findings show ChatGPT’s rapid adoption and potential in healthcare education and practice. However, concerns about its accuracy and depth of understanding prompted this evaluation. Using a structured prompt engineering process, ChatGPT was tasked to generate questions across various organ systems, which were then reviewed by pharmacology experts. ChatGPT consistently met the NBME criteria, achieving an average score of 13.7 out of 16 (85.6%) from expert 1 and 14.5 out of 16 (90.6%) from expert 2, with a combined average of 14.1 out of 16 (88.1%) (Kappa coefficient = 0.76). Despite these high scores, challenges in medical accuracy and depth were noted, often producing “pseudo vignettes” instead of in-depth clinical questions. ChatGPT-3.5 shows potential for generating NBME-style questions, but improvements in medical accuracy and understanding are crucial for its reliable use in medical education. This study underscores the need for AI models tailored to the medical domain to enhance educational tools for medical students.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords