Scientific Data (Jun 2024)

Folding the human proteome using BioNeMo: A fused dataset of structural models for machine learning purposes

  • Michael Hetmann,
  • Lena Parigger,
  • Hassan Sirelkhatim,
  • Abraham Stern,
  • Andreas Krassnigg,
  • Karl Gruber,
  • Georg Steinkellner,
  • David Ruau,
  • Christian C. Gruber

DOI
https://doi.org/10.1038/s41597-024-03403-z
Journal volume & issue
Vol. 11, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Human proteins are crucial players in both health and disease. Understanding their molecular landscape is a central topic in biological research. Here, we present an extensive dataset of predicted protein structures for 42,042 distinct human proteins, including splicing variants, derived from the UniProt reference proteome UP000005640. To ensure high quality and comparability, the dataset was generated by combining state-of-the-art modeling-tools AlphaFold 2, OpenFold, and ESMFold, provided within NVIDIA’s BioNeMo platform, as well as homology modeling using Innophore’s CavitomiX platform. Our dataset is offered in both unedited and edited formats for diverse research requirements. The unedited version contains structures as generated by the different prediction methods, whereas the edited version contains refinements, including a dataset of structures without low prediction-confidence regions and structures in complex with predicted ligands based on homologs in the PDB. We are confident that this dataset represents the most comprehensive collection of human protein structures available today, facilitating diverse applications such as structure-based drug design and the prediction of protein function and interactions.