Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations

Bernadette Quah; Lei Zheng; Timothy Jie Han Sng; Chee Weng Yong; Intekhab Islam

doi:10.1186/s12909-024-05881-6

BMC Medical Education (Sep 2024)

Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations

Bernadette Quah,
Lei Zheng,
Timothy Jie Han Sng,
Chee Weng Yong,
Intekhab Islam

Affiliations

Bernadette Quah: Faculty of Dentistry, National University of Singapore
Lei Zheng: Faculty of Dentistry, National University of Singapore
Timothy Jie Han Sng: Faculty of Dentistry, National University of Singapore
Chee Weng Yong: Faculty of Dentistry, National University of Singapore
Intekhab Islam: Faculty of Dentistry, National University of Singapore

DOI: https://doi.org/10.1186/s12909-024-05881-6
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Background This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors? Methods Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's α to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations. Results A strong correlation was observed for Question 1 (r = 0.752–0.848, p 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001). Conclusion This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.

Published in BMC Medical Education

ISSN: 1472-6920 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Education: Special aspects of education; Medicine
Website: https://bmcmededuc.biomedcentral.com

About the journal

Abstract

Keywords