Machine Learning Based Missing Data Imputation in Categorical Datasets

Muhammad Ishaq; Sana Zahir; Laila Iftikhar; Mohammad Farhad Bulbul; Seungmin Rho; Mi Young Lee

doi:10.1109/ACCESS.2024.3411817

IEEE Access (Jan 2024)

Machine Learning Based Missing Data Imputation in Categorical Datasets

Muhammad Ishaq,
Sana Zahir,
Laila Iftikhar,
Mohammad Farhad Bulbul,
Seungmin Rho,
Mi Young Lee

Affiliations

Muhammad Ishaq: Institute of Computer Sciences and Information Technology, The University of Agriculture at Peshawar, Peshawar, Khyber Pakhtunkhwa, Pakistan
Sana Zahir: Institute of Computer Sciences and Information Technology, The University of Agriculture at Peshawar, Peshawar, Khyber Pakhtunkhwa, Pakistan
Laila Iftikhar: Institute of Computer Sciences and Information Technology, The University of Agriculture at Peshawar, Peshawar, Khyber Pakhtunkhwa, Pakistan
Mohammad Farhad Bulbul: ORCiD; Department of Mathematics, Jashore University of Science and Technology, Jashore, Bangladesh
Seungmin Rho: ORCiD; Department of Industrial Security, Chung-Ang University, Seoul, South Korea
Mi Young Lee: ORCiD; Department of Research, Chung-Ang University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3411817
Journal volume & issue: Vol. 12
pp. 88332 – 88344

Abstract

Read online

In order to predict and fill in the gaps in categorical datasets, this research looked into the use of machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier that combines models based on SVM, KNN, and MLP. Three diverse datasets—the CPU, Hypothyroid, and Breast Cancer datasets—were employed to validate these algorithms. Results indicated that these machine learning techniques provided substantial performance in predicting and completing missing data, with the effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models, ensemble models that made use of the ECOC framework significantly improved prediction accuracy and robustness. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context of the imputation of missing data.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords