Patterns (May 2024)
Creation of a structured solar cell material dataset and performance prediction using large language models
Abstract
Summary: Materials scientists usually collect experimental data to summarize experiences and predict improved materials. However, a crucial issue is how to proficiently utilize unstructured data to update existing structured data, particularly in applied disciplines. This study introduces a new natural language processing (NLP) task called structured information inference (SII) to address this problem. We propose an end-to-end approach to summarize and organize the multi-layered device-level information from the literature into structured data. After comparing different methods, we fine-tuned LLaMA with an F1 score of 87.14% to update an existing perovskite solar cell dataset with articles published since its release, allowing its direct use in subsequent data analysis. Using structured information, we developed regression tasks to predict the electrical performance of solar cells. Our results demonstrate comparable performance to traditional machine-learning methods without feature selection and highlight the potential of large language models for scientific knowledge acquisition and material development. The bigger picture: Big data’s importance in materials science is clear, yet its effective use is challenging due to the sheer volume and complexity of the data. Natural language processing (NLP) offers a solution by transforming unstructured text into structured formats, facilitating tasks such as extraction and summarization. In materials science, this means converting information from scientific papers into structured datasets, a process often slowed by the continuous influx of new data. To circumvent the inefficiencies of multi-step NLP workflows, there is a growing need for streamlined, one-step NLP methods. Employing fine-tuned large language models could be key, allowing for the rapid updating of datasets and providing valuable training data for further model development. This approach not only expedites research but also accelerates material prediction, leading to faster scientific breakthroughs.