Scientific Reports (Oct 2024)

A model for predicting academic performance on standardised tests for lagging regions based on machine learning and Shapley additive explanations

  • Mario Suaza-Medina,
  • Rita Peñabaena-Niebles,
  • Maria Jubiz-Diaz

DOI
https://doi.org/10.1038/s41598-024-76596-3
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Data are becoming more important in education since they allow for the analysis and prediction of future behaviour to improve academic performance and quality at educational institutions. However, academic performance is affected by regions’ conditions, such as demographic, psychographic, socioeconomic and behavioural variables, especially in lagging regions. This paper presents a methodology based on applying nine classification algorithms and Shapley values to identify the variables that influence the performance of the Colombian standardised test: the Saber 11 exam. This study is innovative because, unlike others, it applies to lagging regions and combines the use of EDM and Shapley values to predict students’ academic performance and analyse the influence of each variable on academic performance. The results show that the algorithms with the best accuracy are Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, and Gradient Boosting Machine. According to the Shapley values, the most influential variables are the socioeconomic level index, gender, region, location of the educational institution, and age. For Colombia, the results showed that male students from urban educational institutions over 18 years have the best academic performance. Moreover, there are differences in educational quality among the lagging regions. Students from Nariño have advantages over ones from other departments. The proposed methodology allows for generating public policies better aligned with the reality of lagging regions and achieving equity in access to education.

Keywords