IEEE Access (Jan 2024)
An Improved Count-Based Classifier for Categorical Data
Abstract
The classification of categorical data is a fundamental task in machine learning, with numerous algorithms and techniques available. However, existing approaches often face challenges related to interpretability, scalability, and handling sparse or imbalanced datasets. This study presents an optimized version of the Count-Based Classifier, a novel approach that leverages the simplicity of counting occurrences to perform classification on categorical data. The optimized algorithm addresses the limitations of the original Count-Based Classifier, improving its computational efficiency, robustness, and overall performance. Through a comprehensive evaluation across 21 diverse categorical datasets, the Optimized Count-Based Classifier demonstrates competitive performance, consistently matching or surpassing established classifiers such as Decision Trees, Support Vector Machines, etc. The classifier’s inherent interpretability, stemming from its reliance on counting operations, is a valuable asset, particularly in domains where transparency and explainability are crucial. Furthermore, the study explores the classifier’s characteristics, including its tendency for overfitting, result consistency, and robustness against label errors. Experimental analyses reveal a low propensity for overfitting, high result consistency, and remarkable resilience to mislabeled data, further solidifying the classifier’s practical applicability. The Optimized Count-Based Classifier has been implemented in Python and deployed as a user-friendly package, fostering accessibility and adoption within the machine learning community. By addressing the limitations of traditional approaches and offering a simple yet effective solution, this work contributes to the advancement of count-based classification techniques and their application in real-world scenarios.
Keywords