IEEE Access (Jan 2023)
PAACDA: Comprehensive Data Corruption Detection Algorithm
Abstract
With the advent of technology, data and its analysis are no longer just values and attributes strewn across spreadsheets, they are now seen as a stepping stone to bring about revolution in any significant field. Data corruption can be brought about by a variety of unethical and illegal sources, making it crucial to develop a method that is highly effective to identify and appropriately highlight the various corrupted data existing in the dataset. Detection of corrupted data, as well as recovering data from a corrupted dataset, is a challenging problem. This requires utmost importance and if not addressed at earlier stages may pose problems in later stages of data processing with machine or deep learning algorithms. In the following work we begin by introducing the PAACDA: Proximity based Adamic Adar Corruption Detection Algorithm and consolidating the results whilst particularly accentuating the detection of corrupted data rather than outliers. Current state of the art models, such as Isolation forest, DBSCAN also called “Density-Based Spatial Clustering of Applications with Noise” and others, are reliant on fine-tuning parameters to provide high accuracy and recall, but they also have a significant level of uncertainty when factoring the corrupted data. In the present work, the authors look into the most niche performance issues of several unsupervised learning algorithms for linear and clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed which outperforms other unsupervised learning benchmarks on 15 popular baselines including K-means clustering, Isolation forest and LOF (Local Outlier Factor) with an accuracy of 96.35% for clustered data and 99.04% for linear data. This article also conducts a thorough exploration of the relevant literature from the previously stated perspectives. In this research work, we pinpoint all the shortcomings of the present techniques and draw direction for future work in this field.
Keywords