An Improved Count-Based Classifier for Categorical Data

Sanskriti Sanjay Kumar Singh; Alok Chauhan

doi:10.1109/ACCESS.2024.3454770

IEEE Access (Jan 2024)

An Improved Count-Based Classifier for Categorical Data

Sanskriti Sanjay Kumar Singh,
Alok Chauhan

Affiliations

Sanskriti Sanjay Kumar Singh: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India
Alok Chauhan: ORCiD; School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India

DOI: https://doi.org/10.1109/ACCESS.2024.3454770
Journal volume & issue: Vol. 12
pp. 125427 – 125445

Abstract

Read online

The classification of categorical data is a fundamental task in machine learning, with numerous algorithms and techniques available. However, existing approaches often face challenges related to interpretability, scalability, and handling sparse or imbalanced datasets. This study presents an optimized version of the Count-Based Classifier, a novel approach that leverages the simplicity of counting occurrences to perform classification on categorical data. The optimized algorithm addresses the limitations of the original Count-Based Classifier, improving its computational efficiency, robustness, and overall performance. Through a comprehensive evaluation across 21 diverse categorical datasets, the Optimized Count-Based Classifier demonstrates competitive performance, consistently matching or surpassing established classifiers such as Decision Trees, Support Vector Machines, etc. The classifier’s inherent interpretability, stemming from its reliance on counting operations, is a valuable asset, particularly in domains where transparency and explainability are crucial. Furthermore, the study explores the classifier’s characteristics, including its tendency for overfitting, result consistency, and robustness against label errors. Experimental analyses reveal a low propensity for overfitting, high result consistency, and remarkable resilience to mislabeled data, further solidifying the classifier’s practical applicability. The Optimized Count-Based Classifier has been implemented in Python and deployed as a user-friendly package, fostering accessibility and adoption within the machine learning community. By addressing the limitations of traditional approaches and offering a simple yet effective solution, this work contributes to the advancement of count-based classification techniques and their application in real-world scenarios.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords