Automated Rule-Based Data Cleaning Using NLP

Konstantinos Mavrogiorgos; Argyro Mavrogiorgou; Athanasios Kiourtis; Nikolaos Zafeiropoulos; Spyridon Kleftakis; Dimosthenis Kyriazis

doi:10.23919/FRUCT56874.2022.9953810

Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2022)

Automated Rule-Based Data Cleaning Using NLP

Konstantinos Mavrogiorgos,
Argyro Mavrogiorgou,
Athanasios Kiourtis,
Nikolaos Zafeiropoulos,
Spyridon Kleftakis,
Dimosthenis Kyriazis

Affiliations

Konstantinos Mavrogiorgos: University of Piraeus, Greece
Argyro Mavrogiorgou: University of Piraeus, Greece
Athanasios Kiourtis: University of Piraeus, Greece
Nikolaos Zafeiropoulos: University of Peiraeus, Greece
Spyridon Kleftakis: University of Peiraeus, Greece
Dimosthenis Kyriazis: University of Peiraeus, Greece

DOI: https://doi.org/10.23919/FRUCT56874.2022.9953810
Journal volume & issue: Vol. 32, no. 1
pp. 162 – 168

Abstract

Read online

Data Cleaning is a subfield of Data Mining that is thriving in the recent years. Ensuring the reliability of data, either when generated or received, is of vital importance to provide the best services possible to users. Accomplishing the aforementioned task is easier said than done, since data are complex, generated at an extremely high rate and are of enormous size. A variety of techniques and methods that are part of other subfields from the domain of the Computer Science have been invoked to assist in making Data Cleaning the most efficient and effective possible. Those subfields include, among others, Natural Language Processing (NLP), which in essence refers to the interaction among computers and human language, seeking to find a way to program computers to be able to process and analyze huge volumes of human language data. NLP is a concept that exists for a long time, but, as time goes by, it is proposed that it can be applied to a variety of concepts that are not solely NLP-related. In this paper, a rule-based data cleaning mechanism is proposed, which utilizes NLP to ensure data reliability. Making use of NLP enabled the mechanism not only to be extremely effective but also to be a lot more efficient compared to other corresponding mechanisms that do not utilize NLP. The mechanism was evaluated upon diverse healthcare datasets, not however being limited to the healthcare domain, but supporting a generalized data cleaning concept.

Published in Proceedings of the XXth Conference of Open Innovations Association FRUCT

ISSN: 2305-7254 (Print); 2343-0737 (Online)
Publisher: FRUCT
Country of publisher: Finland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication
Website: http://fruct.org/publication

About the journal

Abstract

Keywords