GMS Medizinische Informatik, Biometrie und Epidemiologie (Nov 2019)

Data quality monitoring in clinical and observational epidemiologic studies: the role of metadata and process information

  • Richter, Adrian,
  • Schössow, Janka,
  • Werner, André,
  • Schauer, Birgit,
  • Radke, Dörte,
  • Henke, Jörg,
  • Struckmann, Stephan,
  • Schmidt, Carsten Oliver

DOI
https://doi.org/10.3205/mibe000202
Journal volume & issue
Vol. 15, no. 1
p. Doc08

Abstract

Read online

High data quality is fundamental for valid inferences in health research. Metadata, i.e. “data that describe other data”, are essential to implement data quality assessments but more guidance on which metadata to use is needed. Similarly, the selection and use of variables describing the measurement process should be exemplified to improve the design and conduct of observational health studies. This work provides a conceptual framework and overview of metadata and process information for systematic data quality reports based on implementations within the population-based cohort Study of Health in Pomerania (SHIP). In previous years, a prerequisite for automated data quality checks has been established by the augmentation of the data dictionary; the added information of up to 20 different characteristics for each variable is used for data quality assessments and triggers diverse data quality checks. Conceptually we distinguish static metadata, variable metadata, and process variables. Examples for static metadata are the expected probability distribution, plausibility limits, and the data type. Variable metadata may be reference limits of a laboratory marker. Information inherent to these metadata allows for the detection of data quality flaws by comparing observed with expected data characteristics. In contrast, process variables, such as the observer or device ID, also allow for the identification of sources of data quality issues. This is the case even if characteristics defined in metadata were not violated. Metadata and process variables can be used alone or in combination to implement a versatile and efficient data quality assessment. A comprehensive setup of metadata and process variables is an extensive task, particularly in studies involving large data collections. Nonetheless, the gain in transparency and efficacy of data curation and quality reporting after this setup is considerable.

Keywords