Reducing training data needs with minimal multilevel machine learning (M3L)

Stefan Heinen; Danish Khan; Guido Falk von Rudorff; Konstantin Karandashev; Daniel Jose Arismendi Arrieta; Alastair J A Price; Surajit Nandi; Arghya Bhowmik; Kersti Hermansson; O Anatole von Lilienfeld

doi:10.1088/2632-2153/ad4ae5

Machine Learning: Science and Technology (Jan 2024)

Reducing training data needs with minimal multilevel machine learning (M3L)

Stefan Heinen,
Danish Khan,
Guido Falk von Rudorff,
Konstantin Karandashev,
Daniel Jose Arismendi Arrieta,
Alastair J A Price,
Surajit Nandi,
Arghya Bhowmik,
Kersti Hermansson,
O Anatole von Lilienfeld

Affiliations

Stefan Heinen: ORCiD; Vector Institute for Artificial Intelligence , Toronto, ON M5S 1M1, Canada
Danish Khan: Vector Institute for Artificial Intelligence , Toronto, ON M5S 1M1, Canada; Department of Chemistry, University of Toronto , St. George Campus, Toronto, ON, Canada
Guido Falk von Rudorff: Department of Chemistry , University Kassel, Heinrich-Plett-Str.40, 34132 Kassel, Germany; Center for Interdisciplinary Nanostructure Science and Technology (CINSaT) , Heinrich-Plett-Straße 40, 34132 Kassel, Germany
Konstantin Karandashev: University of Vienna, Faculty of Physics , Kolingasse 14–16, AT-1090 Wien, Austria
Daniel Jose Arismendi Arrieta: Department of Chemistry-Ångström Laboratory, Uppsala University , Box 538, SE-75121 Uppsala, Sweden
Alastair J A Price: Department of Chemistry, University of Toronto , St. George Campus, Toronto, ON, Canada; Acceleration Consortium, University of Toronto , 80 St George St, Toronto, ON M5S 3H6, Canada
Surajit Nandi: Department of Energy Conversion and Storage, DTU, Anker Engelunds Vej , DK-2800 Kgs. Lyngby, Denmark
Arghya Bhowmik: ORCiD; Department of Energy Conversion and Storage, DTU, Anker Engelunds Vej , DK-2800 Kgs. Lyngby, Denmark
Kersti Hermansson: ORCiD; Department of Chemistry-Ångström Laboratory, Uppsala University , Box 538, SE-75121 Uppsala, Sweden
O Anatole von Lilienfeld: ORCiD; Vector Institute for Artificial Intelligence , Toronto, ON M5S 1M1, Canada; Department of Chemistry, University of Toronto , St. George Campus, Toronto, ON, Canada; Acceleration Consortium, University of Toronto , 80 St George St, Toronto, ON M5S 3H6, Canada; Department of Materials Science and Engineering, University of Toronto , St. George campus, Toronto, ON, Canada; Department of Physics, University of Toronto , St. George campus, Toronto, ON, Canada; Machine Learning Group, Technische Universität Berlin and Berlin Institute for the Foundations of Learning and Data , Berlin, Germany

DOI: https://doi.org/10.1088/2632-2153/ad4ae5
Journal volume & issue: Vol. 5, no. 2
p. 025058

Abstract

Read online

For many machine learning applications in science, data acquisition, not training, is the bottleneck even when avoiding experiments and relying on computation and simulation. Correspondingly, and in order to reduce cost and carbon footprint, training data efficiency is key. We introduce minimal multilevel machine learning (M3L) which optimizes training data set sizes using a loss function at multiple levels of reference data in order to minimize a combination of prediction error with overall training data acquisition costs (as measured by computational wall-times). Numerical evidence has been obtained for calculated atomization energies and electron affinities of thousands of organic molecules at various levels of theory including HF, MP2, DLPNO-CCSD(T), DFHFCABS, PNOMP2F12, and PNOCCSD(T)F12, and treating them with basis sets TZ, cc-pVTZ, and AVTZ-F12. Our M3L benchmarks for reaching chemical accuracy in distinct chemical compound sub-spaces indicate substantial computational cost reductions by factors of ∼1.01, 1.1, 3.8, 13.8, and 25.8 when compared to heuristic sub-optimal multilevel machine learning (M2L) for the data sets QM7b, QM9 $^\mathrm{LCCSD(T)}$ , Electrolyte Genome Project, QM9 $^\mathrm{CCSD(T)}_\mathrm{AE}$ , and QM9 $^\mathrm{CCSD(T)}_\mathrm{EA}$ , respectively. Furthermore, we use M2L to investigate the performance for 76 density functionals when used within multilevel learning and building on the following levels drawn from the hierarchy of Jacobs Ladder: LDA, GGA, mGGA, and hybrid functionals. Within M2L and the molecules considered, mGGAs do not provide any noticeable advantage over GGAs. Among the functionals considered and in combination with LDA, the three on average top performing GGA and Hybrid levels for atomization energies on QM9 using M3L correspond respectively to PW91, KT2, B97D, and τ -HCTH, B3LYP $\ast$ (VWN5), and TPSSH.

Published in Machine Learning: Science and Technology

ISSN: 2632-2153 (Online)
Publisher: IOP Publishing
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://iopscience.iop.org/journal/2632-2153

About the journal

Abstract

Keywords