RUBic: rapid unsupervised biclustering

Brijesh K. Sriwastava; Anup Kumar Halder; Subhadip Basu; Tapabrata Chakraborti

doi:10.1186/s12859-023-05534-3

BMC Bioinformatics (Nov 2023)

RUBic: rapid unsupervised biclustering

Brijesh K. Sriwastava,
Anup Kumar Halder,
Subhadip Basu,
Tapabrata Chakraborti

Affiliations

Brijesh K. Sriwastava: Computer Science and Engineering Department, Government College of Engineering and Leather Technology
Anup Kumar Halder: Faculty of Mathematics and Information Sciences, Warsaw University of Technology
Subhadip Basu: Department of Computer Science and Engineering, Jadavpur University
Tapabrata Chakraborti: The Alan Turing Institute and University College London

DOI: https://doi.org/10.1186/s12859-023-05534-3
Journal volume & issue: Vol. 24, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Biclustering of biologically meaningful binary information is essential in many applications related to drug discovery, like protein–protein interactions and gene expressions. However, for robust performance in recently emerging large health datasets, it is important for new biclustering algorithms to be scalable and fast. We present a rapid unsupervised biclustering (RUBic) algorithm that achieves this objective with a novel encoding and search strategy. RUBic significantly reduces the computational overhead on both synthetic and experimental datasets shows significant computational benefits, with respect to several state-of-the-art biclustering algorithms. In 100 synthetic binary datasets, our method took $$\sim 71.1$$ ∼ 71.1 s to extract 494,872 biclusters. In the human PPI database of size $$4085\times 4085$$ 4085 × 4085 , our method generates 1840 biclusters in $$\sim 48.6$$ ∼ 48.6 s. On a central nervous system embryonic tumor gene expression dataset of size 712,940, our algorithm takes 101 min to produce 747,069 biclusters, while the recent competing algorithms take significantly more time to produce the same result. RUBic is also evaluated on five different gene expression datasets and shows significant speed-up in execution time with respect to existing approaches to extract significant KEGG-enriched bi-clustering. RUBic can operate on two modes, base and flex, where base mode generates maximal biclusters and flex mode generates less number of clusters and faster based on their biological significance with respect to KEGG pathways. The code is available at ( https://github.com/CMATERJU-BIOINFO/RUBic ) for academic use only.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords