IEEE Access (Jan 2021)

Empirical Comparisons for Combining Balancing and Feature Selection Strategies for Characterizing Football Players Using FIFA Video Game System

  • Mustafa A. Al-Asadi,
  • Sakir Tasdemir

DOI
https://doi.org/10.1109/ACCESS.2021.3124931
Journal volume & issue
Vol. 9
pp. 149266 – 149286

Abstract

Read online

The process of modelling individual player performance using machine learning is a mature task in sports analytics. The most significant challenges in machine learning include class imbalance and high dimensionality problems. We conducted a comprehensive literature review and observed that both the issues have been studied independently. We found that feature selection addresses the dimensionality reduction problem by determining a subset of relevant features, while data sampling seeks to make the data more balanced by adding or removing instances. We also found out that efforts have been taken for studying the effect of the joint use of feature selection and balancing techniques. However, the prioritization of the feature selection and sampling is still difficult, and the relationship between them remains unclear. This paper presents a large-scale comparison of characterizing football players into nine positions by using FIFA video game data, whereas most of the previous studies in this field have focused on characterizing players into only three classes according to their positions. The proposed methodology for the study consists of three main steps. In the first step, the sampling technique is applied to deal with class imbalance, while the second step encompasses the feature selection technique, which deals with the high dimensionality problem. The third step combines feature selection and data sampling to deal with both the issues. We made the comparisons based on nine feature selection algorithms and three balancing techniques, and then we evaluated their performance using the random forest classifier. We found that 1) feature selection techniques did not improve the accuracy of the baseline model, 2) balancing techniques improved the accuracy compared to the baseline, and 3) the results showed superiority of the proposed methodology, involving the joint application of resampling and feature selection with data balanced by the random oversampling (ROS) method and synthetic minority oversampling technique (SMOTE), compared to the results obtained only through the use of a single technique and from the original imbalanced training set. Overall, the proposed methodology improved prediction accuracy compared to the baseline model. Moreover, the methodology provided a significant decrease in the number of features, from 29 to 10 features on average.

Keywords