Accounting for diverse feature-types improves patient stratification on tabular clinical datasets

Saptarshi Bej; Chaithra Umesh; Manjunath Mahendra; Kristian Schultz; Jit Sarkar; Olaf Wolkenhauer

Machine Learning with Applications (Dec 2023)

Accounting for diverse feature-types improves patient stratification on tabular clinical datasets

Saptarshi Bej,
Chaithra Umesh,
Manjunath Mahendra,
Kristian Schultz,
Jit Sarkar,
Olaf Wolkenhauer

Affiliations

Saptarshi Bej: Indian Institute of Science Education and Research, Thiruvananthapuram, India; Corresponding author at: Indian Institute of Science Education and Research, Thiruvananthapuram, India.
Chaithra Umesh: Department of Systems Biology and Bioinformatics, University of Rostock, Germany
Manjunath Mahendra: Department of Systems Biology and Bioinformatics, University of Rostock, Germany
Kristian Schultz: Department of Systems Biology and Bioinformatics, University of Rostock, Germany
Jit Sarkar: Tumour Immunology Group, Comprehensive Cancer Centre, King’s College London, UK
Olaf Wolkenhauer: Department of Systems Biology and Bioinformatics, University of Rostock, Germany; Leibniz-Institute for Food Systems Biology, Technical University Munich, Germany; Stellenbosch Institute for Advanced Study, South Africa

Journal volume & issue: Vol. 14
p. 100490

Abstract

Read online

Tabular Clinical and Biomedical Routine Data (CBRD) contains diverse feature types. Recent research shows that the conventional application of Uniform Manifold Projection and Approximation (UMAP) to extract clusters from the low dimensional embedding can prove ineffective due to the diverse feature types in such datasets. Feature-type Distributed Clustering (FDC) workflow accounts for these diverse feature types resulting in a more informative low-dimensional embedding. However, a rigorous assessment of the FDC algorithm is missing so far. In this work, we conducted comprehensive benchmarking experiments to compare the quality of the cluster distributions and low dimensional embeddings generated by the FDC against that of the ones generated by UMAP using standard objective measures: Silhouette score, Dunn index, and ANOVA. Our results confirm that FDC can indeed be the better choice to embed tabular data with diverse feature types in low dimensions and thereby extract clusters from such an embedding. In addition, we provide a rationale behind the choice of metrics proposed in the FDC workflow. Moreover, we also point out some problems with the original Canberra metric used to reduce ordinal features in the FDC workflow and provide a solution in the form of a modified version of the Canberra metric. Using seven datasets from the medical domain for benchmarking, we demonstrate that FDC leads to improved patient stratification.

Published in Machine Learning with Applications

ISSN: 2666-8270 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/machine-learning-with-applications

About the journal

Abstract

Keywords