scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics
Yuchen Wang,
Xingjian Chen,
Zetian Zheng,
Lei Huang,
Weidun Xie,
Fuzhou Wang,
Zhaolei Zhang,
Ka-Chun Wong
Affiliations
Yuchen Wang
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
Xingjian Chen
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR; Cutaneous Biology Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Zetian Zheng
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
Lei Huang
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
Weidun Xie
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
Fuzhou Wang
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
Zhaolei Zhang
Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada; Department of Computer Science, University of Toronto, Toronto, ON, Canada
Ka-Chun Wong
Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR; Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China; Corresponding author
Summary: Gene regulatory networks (GRNs) involve complex and multi-layer regulatory interactions between regulators and their target genes. Precise knowledge of GRNs is important in understanding cellular processes and molecular functions. Recent breakthroughs in single-cell sequencing technology made it possible to infer GRNs at single-cell level. Existing methods, however, are limited by expensive computations, and sometimes simplistic assumptions. To overcome these obstacles, we propose scGREAT, a framework to infer GRN using gene embeddings and transformer from single-cell transcriptomics. scGREAT starts by constructing gene expression and gene biotext dictionaries from scRNA-seq data and gene text information. The representation of TF gene pairs is learned through optimizing embedding space by transformer-based engine. Results illustrated scGREAT outperformed other contemporary methods on benchmarks. Besides, gene representations from scGREAT provide valuable gene regulation insights, and external validation on spatial transcriptomics illuminated the mechanism behind scGREAT annotation. Moreover, scGREAT identified several TF target regulations corroborated in studies.