IEEE Access (Jan 2023)

Quality Anomaly Detection Using Predictive Techniques: An Extensive Big Data Quality Framework for Reliable Data Analysis

  • Elouataoui Widad,
  • Elmendili Saida,
  • Youssef Gahi

DOI
https://doi.org/10.1109/ACCESS.2023.3317354
Journal volume & issue
Vol. 11
pp. 103306 – 103318

Abstract

Read online

The increasing reliance on Big Data analytics has highlighted the critical role of data quality in ensuring accurate and reliable results. Consequently, organizations aiming to leverage the power of Big Data recognize the crucial role of data quality as an integral component. One notable type of data quality anomaly observed in big datasets is the presence of outlier values. Detecting and addressing these outliers have become a subject of interest across diverse domains, leading to the development of numerous anomaly detection approaches. Although anomaly detection has witnessed a proliferation of practices in recent years, a significant gap remains in addressing anomalies related to the other aspects of data quality. Indeed, while most approaches focus on identifying anomalies that deviate from the expected patterns, they do not consider irregularities in data quality, such as missing, incorrect, or inconsistent data. Moreover, most of approaches are domain-correlated and lack the capability to detect anomalies in a generic manner. Thus, we aim through this paper to address this gap in the field and provide a holistic and effective solution for Big Data quality anomaly detection. To achieve this, we suggest a novel approach that allows a comprehensive detection of Big Data quality anomalies related to six quality dimensions: Accuracy, Consistency, Completeness, Conformity, Uniqueness, and Readability. Moreover, the framework allows for sophisticated detection of generic data quality anomalies through the implementation of an intelligent anomaly detection model without any correlation to a specific field. Furthermore, we introduce and measure a new metric called “Quality Anomaly Score,” which refers to the degree of anomalousness of the quality anomalies of each quality dimension and the entire dataset. Through the implementation and evaluation of our framework, the suggested framework has achieved an accuracy score of up to 99.91% and an F1-score of 98.07%.

Keywords