A data science roadmap for open science organizations engaged in early-stage drug discovery

Kristina Edfeldt; Aled M. Edwards; Ola Engkvist; Judith Günther; Matthew Hartley; David G. Hulcoop; Andrew R. Leach; Brian D. Marsden; Amelie Menge; Leonie Misquitta; Susanne Müller; Dafydd R. Owen; Kristof T. Schütt; Nicholas Skelton; Andreas Steffen; Alexander Tropsha; Erik Vernet; Yanli Wang; James Wellnitz; Timothy M. Willson; Djork-Arné Clevert; Benjamin Haibe-Kains; Lovisa Holmberg Schiavone; Matthieu Schapira

doi:10.1038/s41467-024-49777-x

Nature Communications (Jul 2024)

A data science roadmap for open science organizations engaged in early-stage drug discovery

Kristina Edfeldt,
Aled M. Edwards,
Ola Engkvist,
Judith Günther,
Matthew Hartley,
David G. Hulcoop,
Andrew R. Leach,
Brian D. Marsden,
Amelie Menge,
Leonie Misquitta,
Susanne Müller,
Dafydd R. Owen,
Kristof T. Schütt,
Nicholas Skelton,
Andreas Steffen,
Alexander Tropsha,
Erik Vernet,
Yanli Wang,
James Wellnitz,
Timothy M. Willson,
Djork-Arné Clevert,
Benjamin Haibe-Kains,
Lovisa Holmberg Schiavone,
Matthieu Schapira

Affiliations

Kristina Edfeldt: Structural Genomics Consortium, Department of Medicine, Karolinska University Hospital and Karolinska Institutet
Aled M. Edwards: Structural Genomics Consortium, University of Toronto
Ola Engkvist: Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden & Department of Computer Science and Engineering, Chalmers University of Technology
Judith Günther: Bayer AG Research and Development, Computational Molecular Design
Matthew Hartley: European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus
David G. Hulcoop: Open Targets, Wellcome Genome Campus
Andrew R. Leach: European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus
Brian D. Marsden: Centre for Medicines Discovery, NDM, University of Oxford
Amelie Menge: Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt am Main, 60438, Germany & Structural Genomics Consortium (SGC), Buchmann Institute for Life Sciences, Johann Wolfgang Goethe University
Leonie Misquitta: National Library of Medicine, National Institutes of Health
Susanne Müller: Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt am Main, 60438, Germany & Structural Genomics Consortium (SGC), Buchmann Institute for Life Sciences, Johann Wolfgang Goethe University
Dafydd R. Owen: Pfizer Worldwide Research, Development & Medical
Kristof T. Schütt: Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences
Nicholas Skelton: Department of Discovery Chemistry, Genentech, Inc.
Andreas Steffen: Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences
Alexander Tropsha: Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina
Erik Vernet: Digital Science & Innovation, Novo Nordisk A/S
Yanli Wang: National Library of Medicine, National Institutes of Health
James Wellnitz: Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina
Timothy M. Willson: Structural Genomics Consortium, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill
Djork-Arné Clevert: Pfizer, Worldwide Research, Development and Medical, Machine Learning & Computational Sciences
Benjamin Haibe-Kains: Structural Genomics Consortium, University of Toronto
Lovisa Holmberg Schiavone: Discovery Biology, Discovery Sciences, R&D, AstraZeneca
Matthieu Schapira: Structural Genomics Consortium, University of Toronto

DOI: https://doi.org/10.1038/s41467-024-49777-x
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 10

Abstract

Read online

Abstract The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal