EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets

Eric Schwenker; Weixin Jiang; Trevor Spreadbury; Nicola Ferrier; Oliver Cossairt; Maria K.Y. Chan

Patterns (Nov 2023)

EXSCLAIM!: Harnessing materials science literature for self-labeled microscopy datasets

Eric Schwenker,
Weixin Jiang,
Trevor Spreadbury,
Nicola Ferrier,
Oliver Cossairt,
Maria K.Y. Chan

Affiliations

Eric Schwenker: Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Materials Science and Engineering, Northwestern University, Evanston, IL 60208, USA; Corresponding author
Weixin Jiang: Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
Trevor Spreadbury: Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
Nicola Ferrier: Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL 60439, USA
Oliver Cossairt: Department of Computer Science, Northwestern University, Evanston, IL 60208, USA
Maria K.Y. Chan: Center for Nanoscale Materials, Argonne National Laboratory, Argonne, IL 60439, USA; Corresponding author

Journal volume & issue: Vol. 4, no. 11
p. 100843

Abstract

Read online

Summary: This work introduces the EXSCLAIM! toolkit for the automatic extraction, separation, and caption-based natural language annotation of images from scientific literature. EXSCLAIM! is used to show how rule-based natural language processing and image recognition can be leveraged to construct an electron microscopy dataset containing thousands of keyword-annotated nanostructure images. Moreover, it is demonstrated how a combination of statistical topic modeling and semantic word similarity comparisons can be used to increase the number and variety of keyword annotations on top of the standard annotations from EXSCLAIM! With large-scale imaging datasets constructed from scientific literature, users are well positioned to train neural networks for classification and recognition tasks specific to microscopy—tasks often otherwise inhibited by a lack of sufficient annotated training data. The bigger picture: Due to recent improvements in image resolution and acquisition speed, materials microscopy is experiencing an explosion of published imaging data. The standard publication format, while sufficient for data ingestion scenarios where a selection of images can be critically examined and curated manually, is not conducive to large-scale data aggregation or analysis, hindering data sharing and reuse. Most images in publications are part of a larger figure, with their explicit context buried in the main body or caption text; so even if aggregated, collections of images with weak or no digitized contextual labels have limited value. The tool developed in this work establishes a scalable pipeline for meaningful image-/language-based information curation from scientific literature.

DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem

Published in Patterns

ISSN: 2666-3899 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.cell.com/patterns

About the journal

Abstract

Keywords