Computers and Education: Artificial Intelligence (Dec 2024)
Assessing the proficiency of large language models in automatic feedback generation: An evaluation study
Abstract
Assessment feedback is important to student learning. Learning analytics (LA) powered by artificial intelligence exhibits profound potential in helping instructors with the laborious provision of feedback. Inspired by the recent advancements made by Generative Pre-trained Transformer (GPT) models, we conducted a study to examine the extent to which GPT models hold the potential to advance the existing knowledge of LA-supported feedback systems towards improving the efficiency of feedback provision. Therefore, our study explored the ability of two versions of GPT models – i.e., GPT-3.5 (ChatGPT) and GPT-4 – to generate assessment feedback on students' writing assessment tasks, common in higher education, with open-ended topics for a data science-related course. We compared the feedback generated by GPT models (namely GPT-3.5 and GPT-4) with the feedback provided by human instructors in terms of readability, effectiveness (content containing effective feedback components), and reliability (correct assessment on student performance). Results showed that (1) both GPT-3.5 and GPT-4 were able to generate more readable feedback with greater consistency than human instructors, (2) GPT-4 outperformed GPT-3.5 and human instructors in providing feedback containing information about effective feedback dimensions, including feeding-up, feeding-forward, process level, and self-regulation level, and (3) GPT-4 demonstrated higher reliability of feedback compared to GPT-3.5. Based on our findings, we discussed the potential opportunities and challenges of utilising GPT models in assessment feedback generation.