Data Science and Management (Jun 2025)

Surprisal-based algorithm for detecting anomalies in categorical data

  • Ossama Cherkaoui,
  • Houda Anoun,
  • Abderrahim Maizate

Journal volume & issue
Vol. 8, no. 2
pp. 185 – 195

Abstract

Read online

Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets.

Keywords