Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

Leo Morjaria; Levi Burns; Keyna Bracken; Anthony J. Levinson; Quang N. Ngo; Mark Lee; Matthew Sibbald

doi:10.3390/ime3010004

International Medical Education (Jan 2024)

Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

Leo Morjaria,
Levi Burns,
Keyna Bracken,
Anthony J. Levinson,
Quang N. Ngo,
Mark Lee,
Matthew Sibbald

Affiliations

Leo Morjaria: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada
Levi Burns: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada
Keyna Bracken: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada
Anthony J. Levinson: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada
Quang N. Ngo: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada
Mark Lee: McMaster Education Research, Innovation and Theory (MERIT) Program, McMaster University, Hamilton, ON L8P 1H6, Canada
Matthew Sibbald: Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8P 1H6, Canada

DOI: https://doi.org/10.3390/ime3010004
Journal volume & issue: Vol. 3, no. 1
pp. 32 – 43

Abstract

Read online

Traditional approaches to marking short-answer questions face limitations in timeliness, scalability, inter-rater reliability, and faculty time costs. Harnessing generative artificial intelligence (AI) to address some of these shortcomings is attractive. This study aims to validate the use of ChatGPT for evaluating short-answer assessments in an undergraduate medical program. Ten questions from the pre-clerkship medical curriculum were randomly chosen, and for each, six previously marked student answers were collected. These sixty answers were evaluated by ChatGPT in July 2023 under four conditions: with both a rubric and standard, with only a standard, with only a rubric, and with neither. ChatGPT displayed good Spearman correlations with a single human assessor (r = 0.6–0.7, p p 2 = 0.33). Our findings demonstrate that ChatGPT is a viable, though imperfect, assistant to human assessment, performing comparably to a single expert assessor. This study serves as a foundation for future research on AI-based assessment techniques with potential for further optimization and increased reliability.

Published in International Medical Education

ISSN: 2813-141X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Education: Special aspects of education; Medicine
Website: https://www.mdpi.com/journal/ime

About the journal

Abstract

Keywords