Materials information extraction via automatically generated corpus

Rongen Yan; Xue Jiang; Weiren Wang; Depeng Dang; Yanjing Su

doi:10.1038/s41597-022-01492-2

Scientific Data (Jul 2022)

Materials information extraction via automatically generated corpus

Rongen Yan,
Xue Jiang,
Weiren Wang,
Depeng Dang,
Yanjing Su

Affiliations

Rongen Yan: School of Artificial Intelligence, Beijing Normal University
Xue Jiang: Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing
Weiren Wang: Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing
Depeng Dang: School of Artificial Intelligence, Beijing Normal University
Yanjing Su: Beijing Advanced Innovation Center for Materials Genome Engineering, Institute for Advanced Materials and Technology, University of Science and Technology Beijing

DOI: https://doi.org/10.1038/s41597-022-01492-2
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Information Extraction (IE) in Natural Language Processing (NLP) aims to extract structured information from unstructured text to assist a computer in understanding natural language. Machine learning-based IE methods bring more intelligence and possibilities but require an extensive and accurate labeled corpus. In the materials science domain, giving reliable labels is a laborious task that requires the efforts of many professionals. To reduce manual intervention and automatically generate materials corpus during IE, in this work, we propose a semi-supervised IE framework for materials via automatically generated corpus. Taking the superalloy data extraction in our previous work as an example, the proposed framework using Snorkel automatically labels the corpus containing property values. Then Ordered Neurons-Long Short-Term Memory (ON-LSTM) network is adopted to train an information extraction model on the generated corpus. The experimental results show that the F1-score of γ’ solvus temperature, density and solidus temperature of superalloys are 83.90%, 94.02%, 89.27%, respectively. Furthermore, we conduct similar experiments on other materials, the experimental results show that the proposed framework is universal in the field of materials.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal