BMJ Open (Sep 2023)

Identifying potential biases in code sequences in primary care electronic healthcare records: a retrospective cohort study of the determinants of code frequency

  • Azeem Majeed,
  • Thomas Woodcock,
  • Jonathan Clarke,
  • Paul Aylin,
  • Thomas Beaney,
  • David Salman,
  • Mauricio Barahona

DOI
https://doi.org/10.1136/bmjopen-2023-072884
Journal volume & issue
Vol. 13, no. 9

Abstract

Read online

Objectives To determine whether the frequency of diagnostic codes for long-term conditions (LTCs) in primary care electronic healthcare records (EHRs) is associated with (1) disease coding incentives, (2) General Practice (GP), (3) patient sociodemographic characteristics and (4) calendar year of diagnosis.Design Retrospective cohort study.Setting GPs in England from 2015 to 2022 contributing to the Clinical Practice Research Datalink Aurum dataset.Participants All patients registered to a GP with at least one incident LTC diagnosed between 1 January 2015 and 31 December 2019.Primary and secondary outcome measures The number of diagnostic codes for an LTC in (1) the first and (2) the second year following diagnosis, stratified by inclusion in the Quality and Outcomes Framework (QOF) financial incentive programme.Results 3 113 724 patients were included, with 7 723 365 incident LTCs. Conditions included in QOF had higher rates of annual coding than conditions not included in QOF (1.03 vs 0.32 per year, p<0.0001). There was significant variation in code frequency by GP which was not explained by patient sociodemographics. We found significant associations with patient sociodemographics, with a trend towards higher coding rates in people living in areas of higher deprivation for both QOF and non-QOF conditions. Code frequency was lower for conditions with follow-up time in 2020, associated with the onset of the COVID-19 pandemic.Conclusions The frequency of diagnostic codes for newly diagnosed LTCs is influenced by factors including patient sociodemographics, disease inclusion in QOF, GP practice and the impact of the COVID-19 pandemic. Natural language processing or other methods using temporally ordered code sequences should account for these factors to minimise potential bias.