Epidemics (Jun 2023)

Data pipelines in a public health emergency: The human in the machine

  • Katy A.M. Gaythorpe,
  • Rich G. Fitzjohn,
  • Wes Hinsley,
  • Natsuko Imai,
  • Edward S. Knock,
  • Pablo N. Perez Guzman,
  • Bimandra Djaafara,
  • Keith Fraser,
  • Marc Baguelin,
  • Neil M. Ferguson

Journal volume & issue
Vol. 43
p. 100676

Abstract

Read online

In an emergency epidemic response, data providers supply data on a best-faith effort to modellers and analysts who are typically the end user of data collected for other primary purposes such as to inform patient care. Thus, modellers who analyse secondary data have limited ability to influence what is captured. During an emergency response, models themselves are often under constant development and require both stability in their data inputs and flexibility to incorporate new inputs as novel data sources become available. This dynamic landscape is challenging to work with. Here we outline a data pipeline used in the ongoing COVID-19 response in the UK that aims to address these issues.A data pipeline is a sequence of steps to carry the raw data through to a processed and useable model input, along with the appropriate metadata and context. In ours, each data type had an individual processing report, designed to produce outputs that could be easily combined and used downstream. Automated checks were in-built and added as new pathologies emerged. These cleaned outputs were collated at different geographic levels to provide standardised datasets. Finally, a human validation step was an essential component of the analysis pathway and permitted more nuanced issues to be captured. This framework allowed the pipeline to grow in complexity and volume and facilitated the diverse range of modelling approaches employed by researchers. Additionally, every report or modelling output could be traced back to the specific data version that informed it ensuring reproducibility of results.Our approach has been used to facilitate fast-paced analysis and has evolved over time. Our framework and its aspirations are applicable to many settings beyond COVID-19 data, for example for other outbreaks such as Ebola, or where routine and regular analyses are required.

Keywords