Leveraging gene correlations in single cell transcriptomic data

Kai Silkwood; Emmanuel Dollinger; Joshua Gervin; Scott Atwood; Qing Nie; Arthur D. Lander

doi:10.1186/s12859-024-05926-z

BMC Bioinformatics (Sep 2024)

Leveraging gene correlations in single cell transcriptomic data

Kai Silkwood,
Emmanuel Dollinger,
Joshua Gervin,
Scott Atwood,
Qing Nie,
Arthur D. Lander

Affiliations

Kai Silkwood: Center for Complex Biological Systems, University of California, Irvine
Emmanuel Dollinger: Center for Complex Biological Systems, University of California, Irvine
Joshua Gervin: Center for Complex Biological Systems, University of California, Irvine
Scott Atwood: Center for Complex Biological Systems, University of California, Irvine
Qing Nie: Center for Complex Biological Systems, University of California, Irvine
Arthur D. Lander: Center for Complex Biological Systems, University of California, Irvine

DOI: https://doi.org/10.1186/s12859-024-05926-z
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 43

Abstract

Read online

Abstract Background Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). Results We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. Conclusions New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords