PAACDA: Comprehensive Data Corruption Detection Algorithm

Charvi Bannur; Chaitra Bhat; Kushagra Singh; Shrirang Ambaji Kulkarni; Mrityunjay Doddamani

doi:10.1109/ACCESS.2023.3253022

IEEE Access (Jan 2023)

PAACDA: Comprehensive Data Corruption Detection Algorithm

Charvi Bannur,
Chaitra Bhat,
Kushagra Singh,
Shrirang Ambaji Kulkarni,
Mrityunjay Doddamani

Affiliations

Charvi Bannur: ORCiD; Department of Computer Science and Engineering, People’s Education Society, Bengaluru, India
Chaitra Bhat: Department of Computer Science and Engineering, People’s Education Society, Bengaluru, India
Kushagra Singh: Department of Computer Science and Engineering, People’s Education Society, Bengaluru, India
Shrirang Ambaji Kulkarni: ORCiD; Department of Computer Science and Engineering, National Institute of Engineering, Mysore, India
Mrityunjay Doddamani: ORCiD; School of Mechanical and Materials Engineering, Indian Institute of Technology—Mandi, Mandi, Himachal Pradesh, India

DOI: https://doi.org/10.1109/ACCESS.2023.3253022
Journal volume & issue: Vol. 11
pp. 24908 – 24934

Abstract

Read online

With the advent of technology, data and its analysis are no longer just values and attributes strewn across spreadsheets, they are now seen as a stepping stone to bring about revolution in any significant field. Data corruption can be brought about by a variety of unethical and illegal sources, making it crucial to develop a method that is highly effective to identify and appropriately highlight the various corrupted data existing in the dataset. Detection of corrupted data, as well as recovering data from a corrupted dataset, is a challenging problem. This requires utmost importance and if not addressed at earlier stages may pose problems in later stages of data processing with machine or deep learning algorithms. In the following work we begin by introducing the PAACDA: Proximity based Adamic Adar Corruption Detection Algorithm and consolidating the results whilst particularly accentuating the detection of corrupted data rather than outliers. Current state of the art models, such as Isolation forest, DBSCAN also called “Density-Based Spatial Clustering of Applications with Noise” and others, are reliant on fine-tuning parameters to provide high accuracy and recall, but they also have a significant level of uncertainty when factoring the corrupted data. In the present work, the authors look into the most niche performance issues of several unsupervised learning algorithms for linear and clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed which outperforms other unsupervised learning benchmarks on 15 popular baselines including K-means clustering, Isolation forest and LOF (Local Outlier Factor) with an accuracy of 96.35% for clustered data and 99.04% for linear data. This article also conducts a thorough exploration of the relevant literature from the previously stated perspectives. In this research work, we pinpoint all the shortcomings of the present techniques and draw direction for future work in this field.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords