ClusterTAD: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from Hi-C data

Oluwatosin Oluwadare; Jianlin Cheng

doi:10.1186/s12859-017-1931-2

BMC Bioinformatics (Nov 2017)

ClusterTAD: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from Hi-C data

Oluwatosin Oluwadare,
Jianlin Cheng

Affiliations

Oluwatosin Oluwadare: Electrical Engineering and Computer Science Department, University of Missouri
Jianlin Cheng: Electrical Engineering and Computer Science Department, University of Missouri

DOI: https://doi.org/10.1186/s12859-017-1931-2
Journal volume & issue: Vol. 18, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Background With the development of chromosomal conformation capturing techniques, particularly, the Hi-C technique, the study of the spatial conformation of a genome is becoming an important topic in bioinformatics and computational biology. The Hi-C technique can generate genome-wide chromosomal interaction (contact) data, which can be used to investigate the higher-level organization of chromosomes, such as Topologically Associated Domains (TAD), i.e., locally packed chromosome regions bounded together by intra chromosomal contacts. The identification of the TADs for a genome is useful for studying gene regulation, genomic interaction, and genome function. Results Here, we formulate the TAD identification problem as an unsupervised machine learning (clustering) problem, and develop a new TAD identification method called ClusterTAD. We introduce a novel method to represent chromosomal contacts as features to be used by the clustering algorithm. Our results show that ClusterTAD can accurately predict the TADs on a simulated Hi-C data. Our method is also largely complementary and consistent with existing methods on the real Hi-C datasets of two mouse cells. The validation with the chromatin immunoprecipitation (ChIP) sequencing (ChIP-Seq) data shows that the domain boundaries identified by ClusterTAD have a high enrichment of CTCF binding sites, promoter-related marks, and enhancer-related histone modifications. Conclusions As ClusterTAD is based on a proven clustering approach, it opens a new avenue to apply a large array of clustering methods developed in the machine learning field to the TAD identification problem. The source code, the results, and the TADs generated for the simulated and real Hi-C datasets are available here: https://github.com/BDM-Lab/ClusterTAD .

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords