When to Use Standardization and Normalization: Empirical Evidence From Machine Learning Models and XAI

Khaled Mahmud Sujon; Rohayanti Binti Hassan; Zeba Tusnia Towshi; Manal A. Othman; Md Abdus Samad; Kwonhue Choi

doi:10.1109/ACCESS.2024.3462434

IEEE Access (Jan 2024)

When to Use Standardization and Normalization: Empirical Evidence From Machine Learning Models and XAI

Khaled Mahmud Sujon,
Rohayanti Binti Hassan,
Zeba Tusnia Towshi,
Manal A. Othman,
Md Abdus Samad,
Kwonhue Choi

Affiliations

Khaled Mahmud Sujon: ORCiD; Department of Software Engineering, Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor Bahru, Johor, Malaysia
Rohayanti Binti Hassan: Faculty of Computing, Universiti Teknologi Malaysia (UTM), Johor Bahru, Johor, Malaysia
Zeba Tusnia Towshi: Department of Computer Science and Engineering, Independent University, Bangladesh, Dhaka, Bangladesh
Manal A. Othman: ORCiD; Medical Education Department, College of Medicine, Biomedical Informatics, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
Md Abdus Samad: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea
Kwonhue Choi: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3462434
Journal volume & issue: Vol. 12
pp. 135300 – 135314

Abstract

Read online

Optimizing machine learning (ML) model performance relies heavily on appropriate data preprocessing techniques. Despite the widespread use of standardization and normalization, empirical comparisons across different models, dataset sizes, and domains remain sparse. This study bridges this gap by evaluating five machine learning algorithms- Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Adaptive Boosting (AdaBoost)- on datasets of varying sizes from the business, health, and agriculture domains. This study assessed the models without scaling, with standardized data, and with normalized data. The comparative analysis reveals that while standardization consistently improves the performance of linear models like SVM and LR for large and medium datasets, normalization enhances the performance of linear models for small datasets. Moreover, this study employs SHapley Additive exPlanations (SHAP) summary plots to understand how each feature contributes to the model’s performance interpretability with unscaled and scaled datasets. This study provides practical guidelines for selecting appropriate scaling techniques based on the characteristics of datasets and compatibility with various algorithms. Finally, this investigation laid the foundation for data preprocessing and feature engineering across diverse models and domains which offers actionable insights for practitioners.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords