Genomics, Proteomics & Bioinformatics (Jun 2023)

Performance Comparison of Computational Methods for the Prediction of the Function and Pathogenicity of Non-coding Variants

  • Zheng Wang,
  • Guihu Zhao,
  • Bin Li,
  • Zhenghuan Fang,
  • Qian Chen,
  • Xiaomeng Wang,
  • Tengfei Luo,
  • Yijing Wang,
  • Qiao Zhou,
  • Kuokuo Li,
  • Lu Xia,
  • Yi Zhang,
  • Xun Zhou,
  • Hongxu Pan,
  • Yuwen Zhao,
  • Yige Wang,
  • Lin Wang,
  • Jifeng Guo,
  • Beisha Tang,
  • Kun Xia,
  • Jinchen Li

Journal volume & issue
Vol. 21, no. 3
pp. 649 – 661

Abstract

Read online

Non-coding variants in the human genome significantly influence human traits and complex diseases via their regulation and modification effects. Hence, an increasing number of computational methods are developed to predict the effects of variants in human non-coding sequences. However, it is difficult for inexperienced users to select appropriate computational methods from dozens of available methods. To solve this issue, we assessed 12 performance metrics of 24 methods on four independent non-coding variant benchmark datasets: (1) rare germline variants from clinical relevant sequence variants (ClinVar), (2) rare somatic variants from Catalogue Of Somatic Mutations In Cancer (COSMIC), (3) common regulatory variants from curated expression quantitative trait locus (eQTL) data, and (4) disease-associated common variants from curated genome-wide association studies (GWAS). All 24 tested methods performed differently under various conditions, indicating varying strengths and weaknesses under different scenarios. Importantly, the performance of existing methods was acceptable for rare germline variants from ClinVar with the area under the receiver operating characteristic curve (AUROC) of 0.4481–0.8033 and poor for rare somatic variants from COSMIC (AUROC = 0.4984–0.7131), common regulatory variants from curated eQTL data (AUROC = 0.4837–0.6472), and disease-associated common variants from curated GWAS (AUROC = 0.4766–0.5188). We also compared the prediction performance of 24 methods for non-coding de novo mutations in autism spectrum disorder, and found that the combined annotation-dependent depletion (CADD) and context-dependent tolerance score (CDTS) methods showed better performance. Summarily, we assessed the performance of 24 computational methods under diverse scenarios, providing preliminary advice for proper tool selection and guiding the development of new techniques in interpreting non-coding variants.

Keywords