Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2024)

Extending Desbordante with Probabilistic Functional Dependency Discovery Support

  • Ilia Barutkin,
  • Maxim Alex Fofanov,
  • Sergey A Belokonny,
  • Vladislav Makeev,
  • George Chernishev

DOI
https://doi.org/10.23919/FRUCT61870.2024.10516409
Journal volume & issue
Vol. 35, no. 1
pp. 169 – https://youtu.be/z60b1-gnWTA

Abstract

Read online

Data profiling aims to extract complex patterns from data for further analysis, and use that data in domains such as data cleaning, data deduplication, anomaly detection, and many more. Functional dependencies (FDs) are one of the most well-known patterns. However, they are poorly suited for these tasks, as real data is usually dirty, and the rigid definition of FDs does not allow algorithms to locate them. For this reason, there are several formulations aimed at relaxing FDs to support dirty data, with approximate functional dependency (AFD) being the most popular one. Another formulation is the Probabilistic Functional Dependency (PFD), which we aim to support inside Desbordante~--- a science-intensive, high-performance and open-source data profiling tool implemented in C++. However, PFDs are relatively poorly studied, compared to AFDs. In this paper we study PFDs, both analytically and empirically. We start by assessing how different PFDs and AFDs are by demonstrating cases in which PFDs have an edge over AFDs. Then, we implement the algorithm for PFD discovery, as well as study its run time and memory consumption. We also compare it with an AFD discovery algorithm. Lastly, we study the output of both algorithms to learn whether or not it is possible to use AFD discovery algorithm to get PFDs and vice versa.

Keywords