Journal of Big Data (Nov 2024)

Data-driven prediction of soccer outcomes using enhanced machine and deep learning techniques

  • Ebenezer Fiifi Emire Atta Mills,
  • Zihui Deng,
  • Zhuoqing Zhong,
  • Jinger Li

DOI
https://doi.org/10.1186/s40537-024-01008-2
Journal volume & issue
Vol. 11, no. 1
pp. 1 – 37

Abstract

Read online

Abstract This paper introduces a novel framework for soccer game prediction using advanced machine learning and deep learning techniques, initially focusing on the Dutch Eredivisie League and later expanding to include the Scottish Premiership and the Belgian Jupiler Pro League. The methodology includes data preprocessing, feature engineering, model training, and testing. Various models are evaluated, including enhanced versions of Logistic Regression, XGBoost, Random Forest, SVM, Naive Bayes, Feedforward Neural Network, and Vanilla Recurrent Neural Network. Unlike existing studies that focus on end-of-game features, this research incorporates real-time features like half-time results and goals for in-game decision-making. Advanced data normalization and sampling methods, such as SVM-SMOTE and Near-Miss, are applied to improve model performance. Models are assessed using accuracy, recall, precision, F1-score, and Area under the ROC Curve. Results indicate that the Feedforward Neural Network excels in predicting game results, while Logistic Regression is best for predicting under and over 2.5 goals. The integration of Random Forest and XGBoost in a voting model consistently achieves the highest accuracy across both prediction tasks. The combined use of data from the three leagues further validates the models’ robustness and generalizability. This study demonstrates the potential of machine and deep learning to enhance soccer game predictions through advanced techniques and comprehensive data analysis, making significant contributions to sports analytics.

Keywords