Developing a High Velocity Dataset Quality Checking Pipeline

Hannah Davies; Daniel S Thayer; Muhammad A Elmessary; Alex-Ioan Coldea; Carys Jones; Lorna Evans; Alexander Makovics; Owen Howell-Wright; Lee Hughes

doi:10.23889/ijpds.v9i5.2741

International Journal of Population Data Science (Sep 2024)

Developing a High Velocity Dataset Quality Checking Pipeline

Hannah Davies,
Daniel S Thayer,
Muhammad A Elmessary,
Alex-Ioan Coldea,
Carys Jones,
Lorna Evans,
Alexander Makovics,
Owen Howell-Wright,
Lee Hughes

Affiliations

Hannah Davies: Swansea University
Daniel S Thayer: Swansea University
Muhammad A Elmessary: Swansea University
Alex-Ioan Coldea: Swansea University
Carys Jones: Swansea University
Lorna Evans: Swansea University
Alexander Makovics: Swansea University
Owen Howell-Wright: Swansea University
Lee Hughes: Swansea University

DOI: https://doi.org/10.23889/ijpds.v9i5.2741
Journal volume & issue: Vol. 9, no. 5

Abstract

Read online

Objective The volume and frequency of refreshed data within the [organisation] has increased significantly since the beginning of the COVID-19 pandemic. Therefore, a more efficient data quality (DQ) checking process was necessary. Approach Having previously developed an automated DQ checking tool, the focus was on re-engineering the process of DQ task allocation and communication of results. Results 5 analysts were trained in DQ checking. A JIRA workflow tracks the management of data loading. When a dataset is ready for DQ, the Data Manager allocates a ticket to the DQ Lead who then allocates it onto one of the 5 analysts. Via a DQ Slack channel, the analyst is informed and acknowledges receipt of the task. On completion of DQ, the analyst updates the ticket and transfers it to the appropriate workflow stage. Passed DQ tickets are transferred to the Data Manager for data release, whereas failed ones are placed “On Hold”. The DQ Lead triages the issues and liaises with relevant parties for resolution, which may require data amendments. On receipt of amended data, the DQ ticket is transferred back to the queue and the analyst is notified to re-check the data. Conclusions The team now complete a high volume of DQ checks efficiently. In 2023, 405 datasets, containing 1716 tables, were quality checked, with the initial DQ taking, on average, 2.6 days. Implications The improved speed of DQ checking ensures projects access the latest available data whilst maintaining the expected DQ levels, integrity and reputation of the TRE.

Published in International Journal of Population Data Science

ISSN: 2399-4908 (Online)
Publisher: Swansea University
Country of publisher: United Kingdom
LCC subjects: Social Sciences: Economic theory. Demography: Demography. Population. Vital events
Website: https://ijpds.org

About the journal