Nature Communications (May 2025)

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

  • Yuansong Zeng,
  • Jiancong Xie,
  • Ningyuan Shangguan,
  • Zhuoyi Wei,
  • Wenbing Li,
  • Yun Su,
  • Shuangyu Yang,
  • Chengyang Zhang,
  • Jinbo Zhang,
  • Nan Fang,
  • Hongyu Zhang,
  • Yutong Lu,
  • Huiying Zhao,
  • Jue Fan,
  • Weijiang Yu,
  • Yuedong Yang

DOI
https://doi.org/10.1038/s41467-025-59926-5
Journal volume & issue
Vol. 16, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing.