An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Xinxin Zhang; Jimmy Lee; Wilson Wen Bin Goh

Heliyon (May 2022)

An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Xinxin Zhang,
Jimmy Lee,
Wilson Wen Bin Goh

Affiliations

Xinxin Zhang: School of Biological Sciences, Nanyang Technological University, 637551, Singapore
Jimmy Lee: North Region & Department of Psychosis, Institute of Mental Health, 539747, Singapore; Corresponding author.
Wilson Wen Bin Goh: School of Biological Sciences, Nanyang Technological University, 637551, Singapore; Lee Kong Chian School of Medicine, Nanyang Technological University, 636921, Singapore; Centre for Biomedical Informatics, Nanyang Technological University, 636921, Singapore; Corresponding author.

Journal volume & issue: Vol. 8, no. 5
p. e09502

Abstract

Read online

Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords