Computational and Structural Biotechnology Journal (Jan 2025)

Informational rescaling of PCA maps with application to genetic distance

  • Nassim Nicholas Taleb,
  • Pierre Zalloua,
  • Khaled Elbassioni,
  • Haralampos Hatzikirou,
  • Andreas Henschel,
  • Daniel E. Platt

Journal volume & issue
Vol. 27
pp. 48 – 56

Abstract

Read online

Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as “bits”. We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.

Keywords