Visual Informatics (Mar 2024)
An open dataset of data lineage graphs for data governance research
Abstract
Data have become valuable assets for enterprises. Data governance aims to manage and reuse data assets, facilitating enterprise management and enabling product innovations. A data lineage graph (DLG) is an abstracted collection of data assets and their data lineages in data governance. Analyzing DLGs can provide rich data insights for data governance. However, the progress of data governance technologies is hindered by the shortage of available open datasets for DLGs. This paper introduces an open dataset of DLGs, including the DLG model, the dataset construction process, and applied areas. This real-world dataset is sourced from Huawei Cloud Computing Technology Company Limited, which contains 18 DLGs with three types of data assets and two types of relations. To the best of our knowledge, this dataset is the first open dataset of DLGs for data governance. This dataset can also support the development of other application areas, such as graph analytics and visualization.