Communications Biology (Mar 2024)

Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data

  • Nikita Kotlov,
  • Kirill Shaposhnikov,
  • Cagdas Tazearslan,
  • Madison Chasse,
  • Artur Baisangurov,
  • Svetlana Podsvirova,
  • Dawn Fernandez,
  • Mary Abdou,
  • Leznath Kaneunyenye,
  • Kelley Morgan,
  • Ilya Cheremushkin,
  • Pavel Zemskiy,
  • Maxim Chelushkin,
  • Maria Sorokina,
  • Ekaterina Belova,
  • Svetlana Khorkova,
  • Yaroslav Lozinsky,
  • Katerina Nuzhdina,
  • Elena Vasileva,
  • Dmitry Kravchenko,
  • Kushal Suryamohan,
  • Krystle Nomie,
  • John Curran,
  • Nathan Fowler,
  • Alexander Bagaev

DOI
https://doi.org/10.1038/s42003-024-06020-z
Journal volume & issue
Vol. 7, no. 1
pp. 1 – 14

Abstract

Read online

Abstract With the increased use of gene expression profiling for personalized oncology, optimized RNA sequencing (RNA-seq) protocols and algorithms are necessary to provide comparable expression measurements between exome capture (EC)-based and poly-A RNA-seq. Here, we developed and optimized an EC-based protocol for processing formalin-fixed, paraffin-embedded samples and a machine-learning algorithm, Procrustes, to overcome batch effects across RNA-seq data obtained using different sample preparation protocols like EC-based or poly-A RNA-seq protocols. Applying Procrustes to samples processed using EC and poly-A RNA-seq protocols showed the expression of 61% of genes (N = 20,062) to correlate across both protocols (concordance correlation coefficient > 0.8, versus 26% before transformation by Procrustes), including 84% of cancer-specific and cancer microenvironment-related genes (versus 36% before applying Procrustes; N = 1,438). Benchmarking analyses also showed Procrustes to outperform other batch correction methods. Finally, we showed that Procrustes can project RNA-seq data for a single sample to a larger cohort of RNA-seq data. Future application of Procrustes will enable direct gene expression analysis for single tumor samples to support gene expression-based treatment decisions.