International Journal of Population Data Science (Sep 2024)

Developing a generalisable stratification approach for clerical review of linked data

  • Leah Maizey,
  • Josie Platcha,
  • Tim Gammon,
  • Matt Wray,
  • Gavin Thompson,
  • Laszlo Antal,
  • Rosaland Archer

DOI
https://doi.org/10.23889/ijpds.v9i5.2651
Journal volume & issue
Vol. 9, no. 5

Abstract

Read online

Objective Data linkage is a vital process in the creation of many national statistics, but understanding the quality of linked data is currently highly inefficient. To find errors, data must be reviewed by humans which is costly and lengthy. Sampling is used to reduce the clerical burden. This research aims to develop a method for stratifying links to create representative samples while reducing the number reviewed. The final method will enable nuanced stratification of data for review whilst optimising resource efficiency. The objectives are to: • ensure that the method is adaptable across diverse datasets, • achieve full automation, • ensure scalability to accommodate large datasets. Approach Our approach centres on designing an algorithm that responds to the variability in the data distribution of probabilistic scores and stratify accordingly. The intention is for the developed method to automatically adjust its parameters, such as strata threshold and numbers based on the data’s characteristics. The research involves a comparative analysis of the performance of dynamic- and percentile-based stratification against the current standard practice of static threshold stratification. Results Tests are ongoing to compare the above methods on a variety of metrics including homogeneity of strata, total variance, and between-strata distance. Findings will be presented at the conference. Conclusions We hope to design a robust, generalisable and scalable stratification method that can be integrated into a Linkage pipeline. Implications Implementing the method will help to improve the quality of national statistics, ensuring more accurate, reliable and timely outputs are produced in a resource efficient manner.