PLoS ONE (Jan 2024)

Data cleaning and harmonization of clinical trial data: Medication-assisted treatment for opioid use disorder.

  • Raymond R Balise,
  • Mei-Chen Hu,
  • Anna R Calderon,
  • Gabriel J Odom,
  • Laura Brandt,
  • Sean X Luo,
  • Daniel J Feaster

DOI
https://doi.org/10.1371/journal.pone.0312695
Journal volume & issue
Vol. 19, no. 11
p. e0312695

Abstract

Read online

Several large-scale, pragmatic clinical trials on opioid use disorder (OUD) have been completed in the National Drug Abuse Treatment Clinical Trials Network (CTN). However, the resulting data have not been harmonized between the studies to compare the patient characteristics. This paper provides lessons learned from a large-scale harmonization process that are critical for all biomedical researchers collecting new data and those tasked with combining datasets. We harmonized data from multiple domains from CTN-0027 (N = 1269), which compared methadone and buprenorphine at federally licensed methadone treatment programs; CTN-0030 (N = 653), which recruited patients who used predominantly prescription opioids and were treated with buprenorphine; and CTN-0051 (N = 570), which compared buprenorphine and extended-release naltrexone (XR-NTX) and recruited from inpatient treatment facilities. Patient-level data were harmonized and a total of 23 database tables, with meticulous documentation, covering more than 110 variables, along with three tables with "meta-data" about the study design and treatment arms, were created. Domains included: social and demographic characteristics, medical and psychiatric history, self-reported drug use details and urine drug screening results, withdrawal, and treatment drug details. Here, we summarize the numerous issues with the organization and fidelity of the publicly available data which were noted and resolved, and present results on patient characteristics across the three trials and the harmonized domains, respectively. A systematic harmonization of OUD clinical trial data can be accomplished, despite heterogeneous data coding and classification procedures, by standardizing commonly assessed characteristics. Similar methods, embracing database normalization and/or "tidy" data, should be used for future datasets in other substance use disorder clinical trials.