Scientific Data (Jan 2023)

SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials

  • Peter Eastman,
  • Pavan Kumar Behara,
  • David L. Dotson,
  • Raimondas Galvelis,
  • John E. Herr,
  • Josh T. Horton,
  • Yuezhi Mao,
  • John D. Chodera,
  • Benjamin P. Pritchard,
  • Yuanqing Wang,
  • Gianni De Fabritiis,
  • Thomas E. Markland

DOI
https://doi.org/10.1038/s41597-022-01882-6
Journal volume & issue
Vol. 10, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.