Extending Desbordante with Probabilistic Functional Dependency Discovery Support

Ilia Barutkin; Maxim Alex Fofanov; Sergey A Belokonny; Vladislav Makeev; George Chernishev

doi:10.23919/FRUCT61870.2024.10516409

Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2024)

Extending Desbordante with Probabilistic Functional Dependency Discovery Support

Ilia Barutkin,
Maxim Alex Fofanov,
Sergey A Belokonny,
Vladislav Makeev,
George Chernishev

Affiliations

Ilia Barutkin: Saint-Petersburg University
Maxim Alex Fofanov: Saint Petersburg University
Sergey A Belokonny: St Petersburg University
Vladislav Makeev: Saint-Petersburg University
George Chernishev: Saint-Petersburg University

DOI: https://doi.org/10.23919/FRUCT61870.2024.10516409
Journal volume & issue: Vol. 35, no. 1
pp. 169 – https://youtu.be/z60b1-gnWTA

Abstract

Read online

Data profiling aims to extract complex patterns from data for further analysis, and use that data in domains such as data cleaning, data deduplication, anomaly detection, and many more. Functional dependencies (FDs) are one of the most well-known patterns. However, they are poorly suited for these tasks, as real data is usually dirty, and the rigid definition of FDs does not allow algorithms to locate them. For this reason, there are several formulations aimed at relaxing FDs to support dirty data, with approximate functional dependency (AFD) being the most popular one. Another formulation is the Probabilistic Functional Dependency (PFD), which we aim to support inside Desbordante~--- a science-intensive, high-performance and open-source data profiling tool implemented in C++. However, PFDs are relatively poorly studied, compared to AFDs. In this paper we study PFDs, both analytically and empirically. We start by assessing how different PFDs and AFDs are by demonstrating cases in which PFDs have an edge over AFDs. Then, we implement the algorithm for PFD discovery, as well as study its run time and memory consumption. We also compare it with an AFD discovery algorithm. Lastly, we study the output of both algorithms to learn whether or not it is possible to use AFD discovery algorithm to get PFDs and vice versa.

Published in Proceedings of the XXth Conference of Open Innovations Association FRUCT

ISSN: 2305-7254 (Print); 2343-0737 (Online)
Publisher: FRUCT
Country of publisher: Finland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication
Website: http://fruct.org/publication

About the journal

Abstract

Keywords