Scientific Reports (Apr 2025)
A three-stage machine learning and inference approach for educational data
Abstract
Abstract A central task in educational studies is to uncover factors that drive a student’s academic performance. While existing studies have utilized meticulous regression designs, it is challenging to select appropriate controls. Machine learning, however, offers a solution whereby the entire variable set can be inspected and filtered by different optimization schemes. In that light, this paper adopts a three-stage framework to analyze and discover potentially latent causal relationships from an open dataset from UCI. In the first stage, machine learning methods are employed to select candidate variables that are closely associated with student grades, and then a “post-double-selection” process is implemented to select the set of control variables. In the final stage, three case studies are conducted to illustrate the effectiveness of the three-stage design. The model pipeline is suitable for situations where there is only minimal prior knowledge available to address a potentially causal research question.
Keywords