Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature

Sarveswara Rao Vangala; Sowmya Ramaswamy Krishnan; Navneet Bung; Dhandapani Nandagopal; Gomathi Ramasamy; Satyam Kumar; Sridharan Sankaran; Rajgopal Srinivasan; Arijit Roy

doi:10.1186/s13321-024-00928-8

Journal of Cheminformatics (Nov 2024)

Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature

Sarveswara Rao Vangala,
Sowmya Ramaswamy Krishnan,
Navneet Bung,
Dhandapani Nandagopal,
Gomathi Ramasamy,
Satyam Kumar,
Sridharan Sankaran,
Rajgopal Srinivasan,
Arijit Roy

Affiliations

Sarveswara Rao Vangala: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Sowmya Ramaswamy Krishnan: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Navneet Bung: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Dhandapani Nandagopal: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Gomathi Ramasamy: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Satyam Kumar: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Sridharan Sankaran: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Rajgopal Srinivasan: TCS Research (Life Sciences Division), Tata Consultancy Services Limited
Arijit Roy: TCS Research (Life Sciences Division), Tata Consultancy Services Limited

DOI: https://doi.org/10.1186/s13321-024-00928-8
Journal volume & issue: Vol. 16, no. 1
pp. 1 – 13

Abstract

Read online

Abstract With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future. Scientific contribution In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

Published in Journal of Cheminformatics

ISSN: 1758-2946 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Chemistry
Website: https://jcheminf.biomedcentral.com/

About the journal

Abstract

Keywords