A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds

A. Lina Heinzke; Barbara Zdrazil; Paul D. Leeson; Robert J. Young; Axel Pahl; Herbert Waldmann; Andrew R. Leach

doi:10.1038/s41597-024-03582-9

Scientific Data (Oct 2024)

A compound-target pairs dataset: differences between drugs, clinical candidates and other bioactive compounds

A. Lina Heinzke,
Barbara Zdrazil,
Paul D. Leeson,
Robert J. Young,
Axel Pahl,
Herbert Waldmann,
Andrew R. Leach

Affiliations

A. Lina Heinzke: European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus
Barbara Zdrazil: European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus
Paul D. Leeson: Paul Leeson Consulting Ltd
Robert J. Young: Blue Burgundy Ltd
Axel Pahl: Compound Management and Screening Center, Max-Planck-Institute of Molecular Physiology
Herbert Waldmann: Department of Chemical Biology, Max-Planck-Institute of Molecular Physiology
Andrew R. Leach: European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus

DOI: https://doi.org/10.1038/s41597-024-03582-9
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Providing a better understanding of what makes a compound a successful drug candidate is crucial for reducing the high attrition rates in drug discovery. Analyses of the differences between active compounds, clinical candidates and drugs require high-quality datasets. However, most datasets of drug discovery programs are not openly available. This work introduces a dataset of compound-target pairs extracted from the open-source bioactivity database ChEMBL (release 32). Compound-target pairs in the dataset either have at least one measured activity or are part of the manually curated set of known interactions in ChEMBL. Known interactions between drugs or clinical candidates and targets are specifically annotated to facilitate analyses of differences between drugs, clinical candidates, and other active compounds. In total, the dataset comprises 614,594 compound-target pairs, 5,109 (3,932) of which are known interactions between drugs (clinical candidates) and targets. The extraction is performed in an automated manner and fully reproducible. We are providing not only the datasets but also the code to rerun the analyses with other ChEMBL releases.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal