BMC Bioinformatics (Mar 2024)
A clustering procedure for three-way RNA sequencing data using data transformations and matrix-variate Gaussian mixture models
Abstract
Abstract RNA sequencing of time-course experiments results in three-way count data where the dimensions are the genes, the time points and the biological units. Clustering RNA-seq data allows to extract groups of co-expressed genes over time. After standardisation, the normalised counts of individual genes across time points and biological units have similar properties as compositional data. We propose the following procedure to suitably cluster three-way RNA-seq data: (1) pre-process the RNA-seq data by calculating the normalised expression profiles, (2) transform the data using the additive log ratio transform to map the composition in the D-part Aitchison simplex to a $$D-1$$ D - 1 -dimensional Euclidean vector, (3) cluster the transformed RNA-seq data using matrix-variate Gaussian mixture models and (4) assess the quality of the overall cluster solution and of individual clusters based on cluster separation in the transformed space using density-based silhouette information and on compactness of the cluster in the original space using cluster maps as a suitable visualisation. The proposed procedure is illustrated on RNA-seq data from fission yeast and results are also compared to an analogous two-way approach after flattening out the biological units.
Keywords