CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

Olga Permiakova; Romain Guibert; Alexandra Kraut; Thomas Fortin; Anne-Marie Hesse; Thomas Burger

doi:10.1186/s12859-021-03969-0

BMC Bioinformatics (Feb 2021)

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

Olga Permiakova,
Romain Guibert,
Alexandra Kraut,
Thomas Fortin,
Anne-Marie Hesse,
Thomas Burger

Affiliations

Olga Permiakova: Univ. Grenoble Alpes, CEA, Inserm, BGE U1038
Romain Guibert: Univ. Grenoble Alpes, CEA, Inserm, BGE U1038
Alexandra Kraut: Univ. Grenoble Alpes, CEA, Inserm, BGE U1038
Thomas Fortin: Univ. Grenoble Alpes, CEA, Inserm, BGE U1038
Anne-Marie Hesse: Univ. Grenoble Alpes, CEA, Inserm, BGE U1038
Thomas Burger: Univ. Grenoble Alpes, CNRS, CEA, Inserm, BGE U1038

DOI: https://doi.org/10.1186/s12859-021-03969-0
Journal volume & issue: Vol. 22, no. 1
pp. 1 – 30

Abstract

Read online

Abstract Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords