Kafkas Universitesi Veteriner Fakültesi Dergisi (Dec 2023)

Comparison of some balancing methods for classification of pacing horses using tree-based machine learning algorithms

  • Hülya ÖZEN,
  • Doğukan ÖZEN,
  • Banu YÜCEER ÖZKUL,
  • Ceyhan ÖZBEYAZ

DOI
https://doi.org/10.9775/kvfd.2023.30325
Journal volume & issue
Vol. 30, no. 1
pp. 31 – 39

Abstract

Read online

Classifiers in machine learning work on the principle that the observations are evenly distributed across the classes. However, real-world datasets frequently exhibit skewed distributions of classes, which is called imbalanced, causing the classifiers make highly biased predictions. One of the several method groups that deal with imbalance data problem is class balancing methods. We aimed to compare some class balancing methods during the classification of pacing horses according to their origins. Data set contains morphological traits of horses and four origin classes with different sample sizes that leads a multi-class imbalanced data problem. Training data set was modified with different balancing methods. Each balanced data set was trained with C5.0, Random Forest and Extreme Gradient Boosting Machine classifiers. Method comparisons were made based on comparison metrics using the original test set. The best prediction result was obtained on the data set balanced with random undersampling method regarding both G-mean and Matthews Correlation Coefficient; however, the best result according to F1 score was observed on the data set balanced with Adaptive Synthetic Sampling Approach (ADASYN). Primary important variables of the best models were body length, withers height, chest circumference and rump height. The Bulgarian origin was the most accurately predicted class despite having the smallest sample size. Class balancing methods clearly improved the performance of classifiers for predicting origins of pacing horses.

Keywords