Applied Sciences (Dec 2022)

Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages

  • Deyou Tang,
  • Rong Zhao,
  • Yuebang Lin,
  • Tangqing Zhang,
  • Pingjian Zhang

DOI
https://doi.org/10.3390/app13010064
Journal volume & issue
Vol. 13, no. 1
p. 64

Abstract

Read online

Data provenance is information about where data come from (provenance data) and how they transform (provenance transformation). Data provenance is widely used to evaluate data quality, trace errors, audit data, and understand references among data. Current studies on data provenance in relational database management systems (RDBMS) still have limitations in supporting full-featured SQL or procedural languages. With these challenges in mind, we present a formal definition of provenance data and provenance transformation for relational data. Then, we propose a solution to support data provenance in relational databases, including provenance graphs and provenance routes. Our method not only solves the complicated problem of modeling provenance in DBMS but also is capable of extending procedural languages in SQL. We also present ProvPg, a PostgreSQL-based prototype database system supporting data provenance in multiple granularities. ProvPg implements extraction, calculation, query, and visualization of provenance. We perform TPC-H tests for ProvPg and PostgreSQL, respectively. Experimental results show that ProvPg addresses the vision of supporting data provenance with little extra computation overhead for the execution engine, which indicates that our model is applicable to lineage tracing applications.

Keywords