CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

Yuansong Zeng; Jiancong Xie; Ningyuan Shangguan; Zhuoyi Wei; Wenbing Li; Yun Su; Shuangyu Yang; Chengyang Zhang; Jinbo Zhang; Nan Fang; Hongyu Zhang; Yutong Lu; Huiying Zhao; Jue Fan; Weijiang Yu; Yuedong Yang

doi:10.1038/s41467-025-59926-5

Nature Communications (May 2025)

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

Yuansong Zeng,
Jiancong Xie,
Ningyuan Shangguan,
Zhuoyi Wei,
Wenbing Li,
Yun Su,
Shuangyu Yang,
Chengyang Zhang,
Jinbo Zhang,
Nan Fang,
Hongyu Zhang,
Yutong Lu,
Huiying Zhao,
Jue Fan,
Weijiang Yu,
Yuedong Yang

Affiliations

Yuansong Zeng: School of Computer Science and Engineering, Sun Yat-sen University
Jiancong Xie: School of Computer Science and Engineering, Sun Yat-sen University
Ningyuan Shangguan: School of Computer Science and Engineering, Sun Yat-sen University
Zhuoyi Wei: School of Computer Science and Engineering, Sun Yat-sen University
Wenbing Li: School of Computer Science and Engineering, Sun Yat-sen University
Yun Su: Huawei Technologies Co., Ltd
Shuangyu Yang: Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University
Chengyang Zhang: School of Big Data and Software Engineering, Chongqing University
Jinbo Zhang: Singleron Biotechnologies, Nanjing
Nan Fang: Singleron Biotechnologies, Nanjing
Hongyu Zhang: School of Big Data and Software Engineering, Chongqing University
Yutong Lu: School of Computer Science and Engineering, Sun Yat-sen University
Huiying Zhao: Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University
Jue Fan: Singleron Biotechnologies, Nanjing
Weijiang Yu: School of Computer Science and Engineering, Sun Yat-sen University
Yuedong Yang: School of Computer Science and Engineering, Sun Yat-sen University

DOI: https://doi.org/10.1038/s41467-025-59926-5
Journal volume & issue: Vol. 16, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Single-cell sequencing provides transcriptomic profiling at single-cell resolution, uncovering cellular heterogeneity with unprecedented precision. Yet, current single cell data analysis suffers from the inherent data noises, batch effects, and sparsity, highlighting the requirement of a unified model to represent cellular states. To circumvent this problem, many recent efforts focus on training single-cell foundation models based on large datasets. However, current human foundation models are still limited by the sizes of training data and model parameters. Here, we have collected a diverse dataset of 100 million human cells, on which we train a single-cell foundation model (CellFM) containing 800 million parameters. To balance efficiency and performance, the model is trained through a modified RetNet framework on the MindSpore. Extensive experiments have shown that CellFM outperforms existing models in cell annotation, perturbation prediction, gene function prediction, and gene-gene relationship capturing.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal