Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Jörg Rahnenführer; Riccardo De Bin; Axel Benner; Federico Ambrogi; Lara Lusa; Anne-Laure Boulesteix; Eugenia Migliavacca; Harald Binder; Stefan Michiels; Willi Sauerbrei; Lisa McShane; for topic group “High-dimensional data” (TG9) of the STRATOS initiative

doi:10.1186/s12916-023-02858-y

BMC Medicine (May 2023)

Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Jörg Rahnenführer,
Riccardo De Bin,
Axel Benner,
Federico Ambrogi,
Lara Lusa,
Anne-Laure Boulesteix,
Eugenia Migliavacca,
Harald Binder,
Stefan Michiels,
Willi Sauerbrei,
Lisa McShane,
for topic group “High-dimensional data” (TG9) of the STRATOS initiative

Affiliations

Jörg Rahnenführer: Department of Statistics, TU Dortmund University
Riccardo De Bin: Department of Mathematics, University of Oslo
Axel Benner: Division of Biostatistics, German Cancer Research Center (DKFZ)
Federico Ambrogi: Department of Clinical Sciences and Community Health, University of Milan
Lara Lusa: Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa
Anne-Laure Boulesteix: Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich
Eugenia Migliavacca: Nestle Research, EPFL Innovation Park
Harald Binder: Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg
Stefan Michiels: Service de Biostatistique et d’Épidémiologie, Gustave Roussy, Université Paris-Saclay
Willi Sauerbrei: Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg
Lisa McShane: Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute
for topic group “High-dimensional data” (TG9) of the STRATOS initiative

DOI: https://doi.org/10.1186/s12916-023-02858-y
Journal volume & issue: Vol. 21, no. 1
pp. 1 – 54

Abstract

Read online

Abstract Background In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.

Published in BMC Medicine

ISSN: 1741-7015 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine
Website: http://bmcmedicine.biomedcentral.com

About the journal

Abstract

Keywords