Looking through glass: Knowledge discovery from materials science literature using natural language processing

Vineeth Venugopal; Sourav Sahoo; Mohd Zaki; Manish Agarwal; Nitya Nand Gosvami; N. M. Anoop Krishnan

Patterns (Jul 2021)

Looking through glass: Knowledge discovery from materials science literature using natural language processing

Vineeth Venugopal,
Sourav Sahoo,
Mohd Zaki,
Manish Agarwal,
Nitya Nand Gosvami,
N. M. Anoop Krishnan

Affiliations

Vineeth Venugopal: Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India; Corresponding author
Sourav Sahoo: Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
Mohd Zaki: Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
Manish Agarwal: Computer Services Center, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
Nitya Nand Gosvami: Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
N. M. Anoop Krishnan: Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India; Department of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India; Corresponding author

Journal volume & issue: Vol. 2, no. 7
p. 100290

Abstract

Read online

Summary: Most of the knowledge in materials science literature is in the form of unstructured data such as text and images. Here, we present a framework employing natural language processing, which automates text and image comprehension and precision knowledge extraction from inorganic glasses’ literature. The abstracts are automatically categorized using latent Dirichlet allocation (LDA) to classify and search semantically linked publications. Similarly, a comprehensive summary of images and plots is presented using the caption cluster plot (CCP), providing direct access to images buried in the papers. Finally, we combine the LDA and CCP with chemical elements to present an elemental map, a topical and image-wise distribution of elements occurring in the literature. Overall, the framework presented here can be a generic and powerful tool to extract and disseminate material-specific information on composition–structure–processing–property dataspaces, allowing insights into fundamental problems relevant to the materials science community and accelerated materials discovery. The bigger picture: Most knowledge generated through scientific enquiry in materials domain is presented in the form of unstructured data. Among the available sources such as online websites, digital data, and publications, peer-reviewed journals serve as the undisputed source of reliable information regarding materials synthesis, characterization, and properties. Despite the availability of large data, only a limited fraction is compiled in the form of machine-readable databases, most of which are manually curated. Here, applying natural language processing on a large corpus of journal publications on inorganic glasses, we present a framework of information extraction from text and images, which answers queries related to synthesis and characterization techniques, and even chemical elements used. The scalable approach presented here can be applied to other domains for efficient information retrieval from scientific literature.

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords