Creation of a structured solar cell material dataset and performance prediction using large language models

Tong Xie; Yuwei Wan; Yufei Zhou; Wei Huang; Yixuan Liu; Qingyuan Linghu; Shaozhou Wang; Chunyu Kit; Clara Grazian; Wenjie Zhang; Bram Hoex

Patterns (May 2024)

Creation of a structured solar cell material dataset and performance prediction using large language models

Tong Xie,
Yuwei Wan,
Yufei Zhou,
Wei Huang,
Yixuan Liu,
Qingyuan Linghu,
Shaozhou Wang,
Chunyu Kit,
Clara Grazian,
Wenjie Zhang,
Bram Hoex

Affiliations

Tong Xie: School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia; GreenDynamics Pty. Ltd, Kensington, NSW, Australia
Yuwei Wan: GreenDynamics Pty. Ltd, Kensington, NSW, Australia; Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
Yufei Zhou: Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China
Wei Huang: School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
Yixuan Liu: GreenDynamics Pty. Ltd, Kensington, NSW, Australia
Qingyuan Linghu: GreenDynamics Pty. Ltd, Kensington, NSW, Australia; School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
Shaozhou Wang: School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia; GreenDynamics Pty. Ltd, Kensington, NSW, Australia
Chunyu Kit: Department of Linguistics and Translation, City University of Hong Kong, Hong Kong, China; Corresponding author
Clara Grazian: School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, Australia; DARE ARC Training Centre in Data Analytics for Resources and Environments, South Eveleigh, NSW, Australia
Wenjie Zhang: School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, Australia
Bram Hoex: School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW, Australia; Corresponding author

Journal volume & issue: Vol. 5, no. 5
p. 100955

Abstract

Read online

Summary: Materials scientists usually collect experimental data to summarize experiences and predict improved materials. However, a crucial issue is how to proficiently utilize unstructured data to update existing structured data, particularly in applied disciplines. This study introduces a new natural language processing (NLP) task called structured information inference (SII) to address this problem. We propose an end-to-end approach to summarize and organize the multi-layered device-level information from the literature into structured data. After comparing different methods, we fine-tuned LLaMA with an F1 score of 87.14% to update an existing perovskite solar cell dataset with articles published since its release, allowing its direct use in subsequent data analysis. Using structured information, we developed regression tasks to predict the electrical performance of solar cells. Our results demonstrate comparable performance to traditional machine-learning methods without feature selection and highlight the potential of large language models for scientific knowledge acquisition and material development. The bigger picture: Big data’s importance in materials science is clear, yet its effective use is challenging due to the sheer volume and complexity of the data. Natural language processing (NLP) offers a solution by transforming unstructured text into structured formats, facilitating tasks such as extraction and summarization. In materials science, this means converting information from scientific papers into structured datasets, a process often slowed by the continuous influx of new data. To circumvent the inefficiencies of multi-step NLP workflows, there is a growing need for streamlined, one-step NLP methods. Employing fine-tuned large language models could be key, allowing for the rapid updating of datasets and providing valuable training data for further model development. This approach not only expedites research but also accelerates material prediction, leading to faster scientific breakthroughs.

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords