Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test

Pradeep S. Virdee; Alice Fuller; Michael Jacobs; Tim Holt; Jacqueline Birks

doi:10.1186/s40537-020-00375-w

Journal of Big Data (Nov 2020)

Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test

Pradeep S. Virdee,
Alice Fuller,
Michael Jacobs,
Tim Holt,
Jacqueline Birks

Affiliations

Pradeep S. Virdee: Centre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of Oxford
Alice Fuller: Nuffield Department of Primary Care Health Sciences, University of Oxford
Michael Jacobs: BMS Haematology, John Radcliffe Hospital, Oxford University Hospitals
Tim Holt: Nuffield Department of Primary Care Health Sciences, University of Oxford
Jacqueline Birks: Centre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of Oxford

DOI: https://doi.org/10.1186/s40537-020-00375-w
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 18

Abstract

Read online

Abstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords