Effects of data bias on machine-learning–based material discovery using experimental property data

Masaya Kumagai; Yuki Ando; Atsumi Tanaka; Koji Tsuda; Yukari Katsura; Ken Kurosaki

doi:10.1080/27660400.2022.2109447

Science and Technology of Advanced Materials: Methods (Dec 2022)

Effects of data bias on machine-learning–based material discovery using experimental property data

Masaya Kumagai,
Yuki Ando,
Atsumi Tanaka,
Koji Tsuda,
Yukari Katsura,
Ken Kurosaki

Affiliations

Masaya Kumagai: Kyoto University
Yuki Ando: National Institute for Materials Science (NIMS)
Atsumi Tanaka: The University of Tokyo
Koji Tsuda: RIKEN
Yukari Katsura: RIKEN
Ken Kurosaki: Kyoto University

DOI: https://doi.org/10.1080/27660400.2022.2109447
Journal volume & issue: Vol. 2, no. 1
pp. 302 – 309

Abstract

Read online

Materials informatics (MI) research, which is the discovery of new materials through machine learning (ML) using large-scale material data, has attracted considerable attention in recent years. However, in general, the large-scale material data used in MI are biased owing to differences in the targeted material domains. Moreover, most studies on MI have not clearly demonstrated the influence of data bias on ML models. In this study, we clarify the influence of data bias on ML models by combining the concept of the applicability domain and clustering for large-scale experimental property data in the Starrydata2 material database previously developed by our group. The results show that data bias influences the error and reliability of the predictions made by the ML model. The predictions of the ML model within the applicability domain are highly reliable compared to those made outside the domain. This indicates that the material space that can be reliably discovered by the constructed ML model is limited. Nonetheless, we apply the ML model to a large dataset comprising various material classes and find that new materials similar to known materials can be proposed within a limited space. Thus, our findings demonstrate the importance of considering data bias when constructing and evaluating ML models in MI.

Published in Science and Technology of Advanced Materials: Methods

ISSN: 2766-0400 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Materials of engineering and construction. Mechanics of materials
Website: https://www.tandfonline.com/journals/tstm

About the journal

Abstract

Keywords