Machine Learning-Based Identification of the Strongest Predictive Variables of Winning and Losing in Belgian Professional Soccer

Youri Geurkink; Jan Boone; Steven Verstockt; Jan G. Bourgois

doi:10.3390/app11052378

Applied Sciences (Mar 2021)

Machine Learning-Based Identification of the Strongest Predictive Variables of Winning and Losing in Belgian Professional Soccer

Youri Geurkink,
Jan Boone,
Steven Verstockt,
Jan G. Bourgois

Affiliations

Youri Geurkink: Department of Movement and Sports Sciences, Ghent University, 9000 Ghent, Belgium
Jan Boone: Department of Movement and Sports Sciences, Ghent University, 9000 Ghent, Belgium
Steven Verstockt: Department of Electronics and Information Systems, Research Group IDLab, Ghent University—IMEC, 9052 Ghent, Belgium
Jan G. Bourgois: Department of Movement and Sports Sciences, Ghent University, 9000 Ghent, Belgium

DOI: https://doi.org/10.3390/app11052378
Journal volume & issue: Vol. 11, no. 5
p. 2378

Abstract

Read online

This study aimed to identify the strongest predictive variables of winning and losing in the highest Belgian soccer division. A predictive machine learning model based on a broad range of variables (n = 100) was constructed, using a dataset consisting of 576 games. To avoid multicollinearity and reduce dimensionality, Variance Inflation Factor (threshold of 5) and BorutaShap were respectively applied. A total of 13 variables remained and were used to predict winning or losing using Extreme Gradient Boosting. TreeExplainer was applied to determine feature importance on a global and local level. The model showed an accuracy of 89.6% ± 3.1% (precision: 88.9%; recall: 90.1%, f1-score: 89.5%), correctly classifying 516 out of 576 games. Shots on target from the attacking penalty box showed to be the best predictor. Several physical indicators are amongst the best predictors, as well as contextual variables such as ELO -ratings, added transfers value of the benched players and match location. The results show the added value of the inclusion of a broad spectrum of variables when predicting and evaluating game outcomes. Similar modelling approaches can be used by clubs to identify the strongest predictive variables for their leagues, and evaluate and improve their current quantitative analyses.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords