Science and Technology of Advanced Materials: Methods (Dec 2022)
Effects of data bias on machine-learning–based material discovery using experimental property data
Abstract
Materials informatics (MI) research, which is the discovery of new materials through machine learning (ML) using large-scale material data, has attracted considerable attention in recent years. However, in general, the large-scale material data used in MI are biased owing to differences in the targeted material domains. Moreover, most studies on MI have not clearly demonstrated the influence of data bias on ML models. In this study, we clarify the influence of data bias on ML models by combining the concept of the applicability domain and clustering for large-scale experimental property data in the Starrydata2 material database previously developed by our group. The results show that data bias influences the error and reliability of the predictions made by the ML model. The predictions of the ML model within the applicability domain are highly reliable compared to those made outside the domain. This indicates that the material space that can be reliably discovered by the constructed ML model is limited. Nonetheless, we apply the ML model to a large dataset comprising various material classes and find that new materials similar to known materials can be proposed within a limited space. Thus, our findings demonstrate the importance of considering data bias when constructing and evaluating ML models in MI.
Keywords