The model student: GPT-4 performance on graduate biomedical science exams

Daniel Stribling; Yuxing Xia; Maha K. Amer; Kiley S. Graim; Connie J. Mulligan; Rolf Renne

doi:10.1038/s41598-024-55568-7

Scientific Reports (Mar 2024)

The model student: GPT-4 performance on graduate biomedical science exams

Daniel Stribling,
Yuxing Xia,
Maha K. Amer,
Kiley S. Graim,
Connie J. Mulligan,
Rolf Renne

Affiliations

Daniel Stribling: Department of Molecular Genetics and Microbiology, University of Florida
Yuxing Xia: Department of Neuroscience, Center for Translational Research in Neurodegenerative Disease, College of Medicine, University of Florida
Maha K. Amer: Department of Molecular Genetics and Microbiology, University of Florida
Kiley S. Graim: Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida
Connie J. Mulligan: UF Genetics Institute, University of Florida
Rolf Renne: Department of Molecular Genetics and Microbiology, University of Florida

DOI: https://doi.org/10.1038/s41598-024-55568-7
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 11

Abstract

Read online

Abstract The GPT-4 large language model (LLM) and ChatGPT chatbot have emerged as accessible and capable tools for generating English-language text in a variety of formats. GPT-4 has previously performed well when applied to questions from multiple standardized examinations. However, further evaluation of trustworthiness and accuracy of GPT-4 responses across various knowledge domains is essential before its use as a reference resource. Here, we assess GPT-4 performance on nine graduate-level examinations in the biomedical sciences (seven blinded), finding that GPT-4 scores exceed the student average in seven of nine cases and exceed all student scores for four exams. GPT-4 performed very well on fill-in-the-blank, short-answer, and essay questions, and correctly answered several questions on figures sourced from published manuscripts. Conversely, GPT-4 performed poorly on questions with figures containing simulated data and those requiring a hand-drawn answer. Two GPT-4 answer-sets were flagged as plagiarism based on answer similarity and some model responses included detailed hallucinations. In addition to assessing GPT-4 performance, we discuss patterns and limitations in GPT-4 capabilities with the goal of informing design of future academic examinations in the chatbot era.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal