Communications Biology (Feb 2024)

On the stability of canonical correlation analysis and partial least squares with application to brain-behavior associations

  • Markus Helmer,
  • Shaun Warrington,
  • Ali-Reza Mohammadi-Nejad,
  • Jie Lisa Ji,
  • Amber Howell,
  • Benjamin Rosand,
  • Alan Anticevic,
  • Stamatios N. Sotiropoulos,
  • John D. Murray

DOI
https://doi.org/10.1038/s42003-024-05869-4
Journal volume & issue
Vol. 7, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Associations between datasets can be discovered through multivariate methods like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). A requisite property for interpretability and generalizability of CCA/PLS associations is stability of their feature patterns. However, stability of CCA/PLS in high-dimensional datasets is questionable, as found in empirical characterizations. To study these issues systematically, we developed a generative modeling framework to simulate synthetic datasets. We found that when sample size is relatively small, but comparable to typical studies, CCA/PLS associations are highly unstable and inaccurate; both in their magnitude and importantly in the feature pattern underlying the association. We confirmed these trends across two neuroimaging modalities and in independent datasets with n ≈ 1000 and n = 20,000, and found that only the latter comprised sufficient observations for stable mappings between imaging-derived and behavioral features. We further developed a power calculator to provide sample sizes required for stability and reliability of multivariate analyses. Collectively, we characterize how to limit detrimental effects of overfitting on CCA/PLS stability, and provide recommendations for future studies.