Opportunities and challenges of text mining in materials research
Olga Kononova,
Tanjin He,
Haoyan Huo,
Amalie Trewartha,
Elsa A. Olivetti,
Gerbrand Ceder
Affiliations
Olga Kononova
Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA; Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Tanjin He
Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA; Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Haoyan Huo
Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA; Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Amalie Trewartha
Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Elsa A. Olivetti
Department of Materials Science & Engineering, MIT, Cambridge, MA 02139, USA
Gerbrand Ceder
Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA; Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Corresponding author
Summary: Research publications are the major repository of scientific knowledge. However, their unstructured and highly heterogenous format creates a significant obstacle to large-scale analysis of the information contained within. Recent progress in natural language processing (NLP) has provided a variety of tools for high-quality information extraction from unstructured text. These tools are primarily trained on non-technical text and struggle to produce accurate results when applied to scientific text, involving specific technical terminology. During the last years, significant efforts in information retrieval have been made for biomedical and biochemical publications. For materials science, text mining (TM) methodology is still at the dawn of its development. In this review, we survey the recent progress in creating and applying TM and NLP approaches to materials science field. This review is directed at the broad class of researchers aiming to learn the fundamentals of TM as applied to the materials science publications.