Machine extraction of polymer data from tables using XML versions of scientific articles

Hiroyuki Oka; Atsushi Yoshizawa; Hiroyuki Shindo; Yuji Matsumoto; Masashi Ishii

doi:10.1080/27660400.2021.1899456

Science and Technology of Advanced Materials: Methods (Jan 2021)

Machine extraction of polymer data from tables using XML versions of scientific articles

Hiroyuki Oka,
Atsushi Yoshizawa,
Hiroyuki Shindo,
Yuji Matsumoto,
Masashi Ishii

Affiliations

Hiroyuki Oka: National Institute for Materials Science (NIMS)
Atsushi Yoshizawa: National Institute for Materials Science (NIMS)
Hiroyuki Shindo: Nara Institute of Science and Technology (NAIST)
Yuji Matsumoto: RIKEN
Masashi Ishii: National Institute for Materials Science (NIMS)

DOI: https://doi.org/10.1080/27660400.2021.1899456
Journal volume & issue: Vol. 1, no. 1
pp. 12 – 23

Abstract

Read online

In this study, we examined machine extraction of polymer data from tables in scientific articles. The extraction system consists of five processes: table extraction, data formatting, polymer name recognition, property specifier identification, and data extraction. Tables were first extracted in plain text. XML versions of scientific articles were used, and the tabular forms were accurately extracted, even for complicated tables, such as multi-column, multi-row, and merged tables. Polymer name recognition was performed using a named entity recognizer created by deep neural network learning of polymer names. The preparation cost of the training data was reduced using a rule-based algorithm. The target polymer properties in this study were glass transition temperature (Tg), melting temperature (Tm), and decomposition temperature (Td), and the specifiers were identified using partial string matching. Through these five processes, 2,181 data points for Tg, 1,526 for Tm, and 2,316 for Td were extracted from approximately 18,000 scientific articles published by Elsevier. Nearly half of them were extracted from complicated tables. The F-scores for the extraction were 0.871, 0.870, and 0.841, respectively. These results indicate that the extraction system created in this study can rapidly and accurately collect large amounts of polymer data from tables in scientific literature.

Published in Science and Technology of Advanced Materials: Methods

ISSN: 2766-0400 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Materials of engineering and construction. Mechanics of materials
Website: https://www.tandfonline.com/journals/tstm

About the journal

Abstract

Keywords