Fast Discovery of Inclusion Dependencies with Desbordante

Alexander A Smirnov; Anton Chizhov; Ilya Shchuckin; Nikita Bobrov; George Chernishev

doi:10.23919/FRUCT58615.2023.10143047

Proceedings of the XXth Conference of Open Innovations Association FRUCT (May 2023)

Fast Discovery of Inclusion Dependencies with Desbordante

Alexander A Smirnov,
Anton Chizhov,
Ilya Shchuckin,
Nikita Bobrov,
George Chernishev

Affiliations

Alexander A Smirnov: Saint-Petersburg State University
Anton Chizhov: Saint-Petersburg University
Ilya Shchuckin: Saint-Petersburg University
Nikita Bobrov: Saint Petersburg State University
George Chernishev: Saint-Petersburg State University

DOI: https://doi.org/10.23919/FRUCT58615.2023.10143047
Journal volume & issue: Vol. 33, no. 1
pp. 264 – 275

Abstract

Read online

Inclusion dependency is a relation between attributes of tables that indicates possible Primary Key--Foreign Key references. Automatic discovery of inclusion dependencies is a relevant problem for both academic and industrial communities. The core concern for this problem is the efficiency of discovery process since it is a computationally expensive task. However, existing studies only address the algorithmic side, while leaving out the implementation aspect. At the same time, engineering details are at least as important as the algorithmic ones for achieving good performance. In this paper, we describe techniques for efficient implementation of two algorithms for discovery of inclusion dependencies~--- Spider and Faida. The first one is a classic algorithm whose ideas lie in the foundation of many other inclusion dependency discovery algorithms. We propose an efficient parallelization technique which greatly speeds up the algorithm while simultaneously reducing its memory consumption. The second one is the state-of-the-art approximate algorithm, which we approach by applying four types of optimizations: data buffering, SIMD-enabled execution, careful hash-table selection and parallelization. In order to experimentally evaluate our techniques, we have implemented these algorithms in Desbordante~--- an open-source science-intensive data profiler written in C++. For Spider, we have evaluated several different options, and in case of Faida we have demonstrated that all our optimization techniques yield results. We also compared our implementations with Metanome~--- a Java-based data profiler. Overall, we report up to 5x improvement in terms of run time reduction for Spider and up to 8x for Faida.

inclusion dependency database constraints data profiling knowledge discovery algorithm design

Published in Proceedings of the XXth Conference of Open Innovations Association FRUCT

ISSN: 2305-7254 (Print); 2343-0737 (Online)
Publisher: FRUCT
Country of publisher: Finland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication
Website: http://fruct.org/publication

About the journal

Abstract

Keywords