Deep sampling of gRNA in the human genome and deep-learning-informed prediction of gRNA activities

Heng Zhang; Jianfeng Yan; Zhike Lu; Yangfan Zhou; Qingfeng Zhang; Tingting Cui; Yini Li; Hui Chen; Lijia Ma

doi:10.1038/s41421-023-00549-9

Cell Discovery (May 2023)

Deep sampling of gRNA in the human genome and deep-learning-informed prediction of gRNA activities

Heng Zhang,
Jianfeng Yan,
Zhike Lu,
Yangfan Zhou,
Qingfeng Zhang,
Tingting Cui,
Yini Li,
Hui Chen,
Lijia Ma

Affiliations

Heng Zhang: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine
Jianfeng Yan: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine
Zhike Lu: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine
Yangfan Zhou: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine
Qingfeng Zhang: AIdit Therapeutics
Tingting Cui: AIdit Therapeutics
Yini Li: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine
Hui Chen: AIdit Therapeutics
Lijia Ma: Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine

DOI: https://doi.org/10.1038/s41421-023-00549-9
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Life science studies involving clustered regularly interspaced short palindromic repeat (CRISPR) editing generally apply the best-performing guide RNA (gRNA) for a gene of interest. Computational models are combined with massive experimental quantification on synthetic gRNA-target libraries to accurately predict gRNA activity and mutational patterns. However, the measurements are inconsistent between studies due to differences in the designs of the gRNA-target pair constructs, and there has not yet been an integrated investigation that concurrently focuses on multiple facets of gRNA capacity. In this study, we analyzed the DNA double-strand break (DSB)-induced repair outcomes and measured SpCas9/gRNA activities at both matched and mismatched locations using 926,476 gRNAs covering 19,111 protein-coding genes and 20,268 non-coding genes. We developed machine learning models to forecast the on-target cleavage efficiency (AIdit_ON), off-target cleavage specificity (AIdit_OFF), and mutational profiles (AIdit_DSB) of SpCas9/gRNA from a uniformly collected and processed dataset by deep sampling and massively quantifying gRNA capabilities in K562 cells. Each of these models exhibited superlative performance in predicting SpCas9/gRNA activities on independent datasets when benchmarked with previous models. A previous unknown parameter was also empirically determined regarding the “sweet spot” in the size of datasets used to establish an effective model to predict gRNA capabilities at a manageable experimental scale. In addition, we observed cell type-specific mutational profiles and were able to link nucleotidylexotransferase as the key factor driving these outcomes. These massive datasets and deep learning algorithms have been implemented into the user-friendly web service http://crispr-aidit.com to evaluate and rank gRNAs for life science studies.

Published in Cell Discovery

ISSN: 2056-5968 (Online)
Publisher: Nature Publishing Group
Country of publisher: United Kingdom
LCC subjects: Science: Biology (General): Cytology
Website: http://www.nature.com/celldisc/

About the journal