BMC Cancer (Aug 2019)

Machine learning enables detection of early-stage colorectal cancer by whole-genome sequencing of plasma cell-free DNA

  • Nathan Wan,
  • David Weinberg,
  • Tzu-Yu Liu,
  • Katherine Niehaus,
  • Eric A. Ariazi,
  • Daniel Delubac,
  • Ajay Kannan,
  • Brandon White,
  • Mitch Bailey,
  • Marvin Bertin,
  • Nathan Boley,
  • Derek Bowen,
  • James Cregg,
  • Adam M. Drake,
  • Riley Ennis,
  • Signe Fransen,
  • Erik Gafni,
  • Loren Hansen,
  • Yaping Liu,
  • Gabriel L. Otte,
  • Jennifer Pecson,
  • Brandon Rice,
  • Gabriel E. Sanderson,
  • Aarushi Sharma,
  • John St. John,
  • Catherina Tang,
  • Abraham Tzou,
  • Leilani Young,
  • Girish Putcha,
  • Imran S. Haque

DOI
https://doi.org/10.1186/s12885-019-6003-8
Journal volume & issue
Vol. 19, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Background Blood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumor-derived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer. Methods Whole-genome sequencing was performed on cfDNA extracted from plasma samples (N = 546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validations to assess generalization performance. Results In a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91–0.93) with a mean sensitivity of 85% (95% CI 83–86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance. Conclusions A machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies. Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway.

Keywords