IEEE Access (Jan 2022)

Cleaning Data With Selection Rules

  • Toon Boeckling,
  • Guy De Tre,
  • Antoon Bronselaer

DOI
https://doi.org/10.1109/ACCESS.2022.3222786
Journal volume & issue
Vol. 10
pp. 125212 – 125229

Abstract

Read online

In this paper, we propose and study a type of tuple-level constraint that arises from the selection operator $\sigma $ of relational algebra and that closely resembles the concepts of tuple-level denial constraints. We call this type of constraint selection rules and study their concepts and properties in the setting of data consistency management. The main contribution of this paper is the study of rule implication with selection rules in order to solve the error localization problem by means of the set cover method. It turns out that rule implication can be applied more easily if the representation of selection rules is extended in order to allow gaps between attribute values. We show that the properties of selection rules allow to improve the performance of rule implication. Evaluation of our approach compared to HoloClean on four real-world datasets shows promising results. First, repair with selection rules is often faster and less memory-consumable than HoloClean, especially when the amount of work that rule implication has to do is limited. Second, in terms of precision and recall of error detection and correction, repair strategies with selection rules almost always outperform HoloClean.

Keywords