Nature Communications (Feb 2024)

Structured information extraction from scientific text with large language models

  • John Dagdelen,
  • Alexander Dunn,
  • Sanghoon Lee,
  • Nicholas Walker,
  • Andrew S. Rosen,
  • Gerbrand Ceder,
  • Kristin A. Persson,
  • Anubhav Jain

DOI
https://doi.org/10.1038/s41467-024-45563-x
Journal volume & issue
Vol. 15, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.