Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking

Aldo Sergi; Luca Beltrame; Sergio Marchini; Marco Masseroli

doi:10.1186/s12859-024-05793-8

BMC Bioinformatics (May 2024)

Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking

Aldo Sergi,
Luca Beltrame,
Sergio Marchini,
Marco Masseroli

Affiliations

Aldo Sergi: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
Luca Beltrame: IRCCS Humanitas Research Hospital
Sergio Marchini: IRCCS Humanitas Research Hospital
Marco Masseroli: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano

DOI: https://doi.org/10.1186/s12859-024-05793-8
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Background High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. Results Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. Conclusions Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords