Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Aditi S. Krishnapriyan; Joseph Montoya; Maciej Haranczyk; Jens Hummelshøj; Dmitriy Morozov

doi:10.1038/s41598-021-88027-8

Scientific Reports (Apr 2021)

Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Aditi S. Krishnapriyan,
Joseph Montoya,
Maciej Haranczyk,
Jens Hummelshøj,
Dmitriy Morozov

Affiliations

Aditi S. Krishnapriyan: Computational Research Division, Lawrence Berkeley National Laboratory
Joseph Montoya: Toyota Research Institute
Maciej Haranczyk: IMDEA Materials Institute
Jens Hummelshøj: Toyota Research Institute
Dmitriy Morozov: Computational Research Division, Lawrence Berkeley National Laboratory

DOI: https://doi.org/10.1038/s41598-021-88027-8
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material’s structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal–organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25–30% decrease in root-mean-squared-deviation and an average increase of 40–50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure–property relationships for materials design.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal