SoftwareX (Sep 2024)
TD-COF: A new method for detecting tandem duplications in next generation sequencing data
Abstract
Tandem duplications significantly influence the diversity of the human genome and the occurrence of many complex diseases. However, accurate detection of tandem duplications of low coverage remains a challenging task. Based on RD (read depth), the tandem duplication detection method assumes that there is a linear relationship between the RD value and the tandem duplication number in the region in the genome. However, at low coverage, the RD values of tandem duplication regions and normal regions in the sequencing sample are not significantly different, and it will affect the performance of tandem duplication detection. Therefore, relying on traditional statistical models based on RD strategy to detect tandem duplications often leads to relative low precision and recall. For solving this problem, we propose a new method for identifying tandem duplications in whole-genome sequencing data.TD-COF, a tandem duplication detection method, utilizes the COF (Connectivity-Based Outlier Factor) algorithm. Considering the relative connectivity between intervals in the genome, the algorithm applies a connectivity factor to each bin to calculate its outlier score Additionally, TD-COF introduces the Split Read strategy into RD-based methods, enabling precise identification of the start and end points of tandem duplications down to the level of individual bases. Furthermore, TD-COF incorporates mapping quality as a feature signal. The bins with lower mapping quality are assigned higher outlier values, effectively mitigating interference from mapping errors. Simulation experiments demonstrate that TD-COF outperforms other methods in terms of sensitivity, precision and F1 score. And, TD-COF exhibits high consistency with other methods when applied to real sequencing samples. This study indicates that TD-COF is an effective method for detecting tandem duplications even in regions of low or moderate coverage. In summary, we consider the TD-COF to be an effective method for detecting tandem duplications.