PatCID: an open-access dataset of chemical structures in patent documents

Lucas Morin; Valéry Weber; Gerhard Ingmar Meijer; Fisher Yu; Peter W. J. Staar

doi:10.1038/s41467-024-50779-y

Nature Communications (Aug 2024)

PatCID: an open-access dataset of chemical structures in patent documents

Lucas Morin,
Valéry Weber,
Gerhard Ingmar Meijer,
Fisher Yu,
Peter W. J. Staar

Affiliations

Lucas Morin: IBM Research
Valéry Weber: IBM Research
Gerhard Ingmar Meijer: IBM Research
Fisher Yu: Department of Information Technology and Electrical Engineering, ETH Zürich
Peter W. J. Staar: IBM Research

DOI: https://doi.org/10.1038/s41467-024-50779-y
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 11

Abstract

Read online

Abstract The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal