Frontiers in Genetics (Jun 2023)
OCRFinder: a noise-tolerance machine learning method for accurately estimating open chromatin regions
Abstract
Open chromatin regions are the genomic regions associated with basic cellular physiological activities, while chromatin accessibility is reported to affect gene expressions and functions. A basic computational problem is to efficiently estimate open chromatin regions, which could facilitate both genomic and epigenetic studies. Currently, ATAC-seq and cfDNA-seq (plasma cell-free DNA sequencing) are two popular strategies to detect OCRs. As cfDNA-seq can obtain more biomarkers in one round of sequencing, it is considered more effective and convenient. However, in processing cfDNA-seq data, due to the dynamically variable chromatin accessibility, it is quite difficult to obtain the training data with pure OCRs or non-OCRs, and leads to a noise problem for either feature-based approaches or learning-based approaches. In this paper, we propose a learning-based OCR estimation approach with a noise-tolerance design. The proposed approach, named OCRFinder, incorporates the ideas of ensemble learning framework and semi-supervised strategy to avoid potential overfitting of noisy labels, which are the false positives on OCRs and non-OCRs. Compared to different noise control strategies and state-of-the-art approaches, OCRFinder achieved higher accuracies and sensitivities in the experiments. In addition, OCRFinder also has an excellent performance in ATAC-seq or DNase-seq comparison experiments.
Keywords