Information (Aug 2025)
Automated Grading Method of Python Code Submissions Using Large Language Models and Machine Learning
Abstract
Assessment is fundamental to programming education; however, it is a labour-intensive and complicated process, especially in extensive learning contexts where it relies significantly on human teachers. This paper presents an automated grading methodology designed to assess Python programming exercises, producing both continuous and discrete grades. The methodology incorporates GPT-4-Turbo, a robust large language model, and machine learning models selected by PyCaret’s automated process. The Extra Trees Regressor demonstrated superior performance in continuous grade prediction, with a Mean Absolute Error (MAE) of 4.43 out of 100 and an R2 score of 0.83. The Random Forest Classifier attained the highest scores for discrete grade classification, achieving an accuracy of 91% and a Quadratic Weighted Kappa of 0.84, indicating substantial concordance with human-assigned categories. These findings underscore the promise of integrating LLMs and automated model selection to facilitate scalable, consistent, and equitable assessment in programming education, while substantially alleviating the workload on human evaluators.
Keywords