Computers and Education: Artificial Intelligence (Jun 2024)
Using convolutional neural networks to automatically score eight TIMSS 2019 graphical response items
Abstract
International large-scale assessments (ILSAs) have used graphical response-based items to measure student ability for decades, but they have yet to implement automated scoring of these responses and instead rely on human scoring alone. To investigate how scores provided by machine algorithms compare to those provided by human raters, we applied convolutional neural networks (CNNs) to classify image-based responses from eight Timss 2019 items. Our results show that the most accurate CNN models classified over 99% of the image responses into the appropriate scoring category for dichotomous items and almost 98% for one trichotomous item. Additionally, during the modeling process, the CNNs correctly classified numerous image responses that human raters had scored incorrectly. For most items, the number of incorrectly human-scored responses exceeded the average number of responses misclassified by the most accurate models. These results suggest that automated scoring using CNNs is comparable to, and in many cases more accurate, than human raters, even across a wide variety of graphing tasks. This paper argues that the machine learning procedure explored could be implemented in ILSAs as a verification method to improve the accuracy and consistency of graphical response item scores. In lieu of additional human raters, ILSAs could implement CNN-based automated scoring to provide a second set of scores, thus reducing the workload and costs associated with human scoring.