Cell Reports: Methods (Jul 2024)

Directly selecting cell-type marker genes for single-cell clustering analyses

  • Zihao Chen,
  • Changhu Wang,
  • Siyuan Huang,
  • Yang Shi,
  • Ruibin Xi

Journal volume & issue
Vol. 4, no. 7
p. 100810

Abstract

Read online

Summary: In single-cell RNA sequencing (scRNA-seq) studies, cell types and their marker genes are often identified by clustering and differentially expressed gene (DEG) analysis. A common practice is to select genes using surrogate criteria such as variance and deviance, then cluster them using selected genes and detect markers by DEG analysis assuming known cell types. The surrogate criteria can miss important genes or select unimportant genes, while DEG analysis has the selection-bias problem. We present Festem, a statistical method for the direct selection of cell-type markers for downstream clustering. Festem distinguishes marker genes with heterogeneous distribution across cells that are cluster informative. Simulation and scRNA-seq applications demonstrate that Festem can sensitively select markers with high precision and enables the identification of cell types often missed by other methods. In a large intrahepatic cholangiocarcinoma dataset, we identify diverse CD8+ T cell types and potential prognostic marker genes. Motivation: A fundamental problem in single-cell RNA sequencing (scRNA-seq) studies is identifying cell types and their associated marker genes using clustering and differential expression analysis between clusters. Many sequenced genes are cell-type irrelevant and significantly influence cell-type identification. Ideally, one should select marker genes for the best cell-type identification. However, because cell types are unknown, directly selecting marker genes seems impractical, and surrogate criteria, such as variance, deviance, and zero proportions, are used for gene selection. The surrogate criteria can miss important genes or select unimportant genes, leaving potentially relevant cell types unidentified. We aim to develop a method that can directly select marker genes with high accuracy and significantly improve cell-type identification.

Keywords