An efficient learning based approach for automatic record deduplication with benchmark datasets

M Ravikanth; Sampath Korra; Gowtham Mamidisetti; Maganti Goutham; T. Bhaskar

doi:10.1038/s41598-024-63242-1

Scientific Reports (Jul 2024)

An efficient learning based approach for automatic record deduplication with benchmark datasets

M Ravikanth,
Sampath Korra,
Gowtham Mamidisetti,
Maganti Goutham,
T. Bhaskar

Affiliations

M Ravikanth: Department of CSE, Malla Reddy University
Sampath Korra: Department of CSE, Sri Indu College of Engineering and Technology (A)
Gowtham Mamidisetti: Department of CSE, Malla Reddy University
Maganti Goutham: Department of CSE, Malla Reddy University
T. Bhaskar: Department of CSE CMR College of Engineering and Technology

DOI: https://doi.org/10.1038/s41598-024-63242-1
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 19

Abstract

Read online

Abstract With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords