A rule-free workflow for the automated generation of databases from scientific literature

Luke P. J. Gilligan; Matteo Cobelli; Valentin Taufour; Stefano Sanvito

doi:10.1038/s41524-023-01171-9

npj Computational Materials (Dec 2023)

A rule-free workflow for the automated generation of databases from scientific literature

Luke P. J. Gilligan,
Matteo Cobelli,
Valentin Taufour,
Stefano Sanvito

Affiliations

Luke P. J. Gilligan: School of Physics, AMBER and CRANN Institute, Trinity College
Matteo Cobelli: School of Physics, AMBER and CRANN Institute, Trinity College
Valentin Taufour: Department of Physics and Astronomy, University of California
Stefano Sanvito: School of Physics, AMBER and CRANN Institute, Trinity College

DOI: https://doi.org/10.1038/s41524-023-01171-9
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 14

Abstract

Read online

Abstract In recent times, transformer networks have achieved state-of-the-art performance in a wide range of natural language processing tasks. Here we present a workflow based on the fine-tuning of BERT models for different downstream tasks, which results in the automated extraction of structured information from unstructured natural language in scientific literature. Contrary to existing methods for the automated extraction of structured compound-property relations from similar sources, our workflow does not rely on the definition of intricate grammar rules. Hence, it can be adapted to a new task without requiring extensive implementation efforts and knowledge. We test our data-extraction workflow by automatically generating a database for Curie temperatures and one for band gaps. These are then compared with manually curated datasets and with those obtained with a state-of-the-art rule-based method. Furthermore, in order to showcase the practical utility of the automatically extracted data in a material-design workflow, we employ them to construct machine-learning models to predict Curie temperatures and band gaps. In general, we find that, although more noisy, automatically extracted datasets can grow fast in volume and that such volume partially compensates for the inaccuracy in downstream tasks.

Published in npj Computational Materials

ISSN: 2057-3960 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Materials of engineering and construction. Mechanics of materials; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.nature.com/npjcompumats/

About the journal