MolData, a molecular benchmark for disease and target based machine learning

Arash Keshavarzi Arshadi; Milad Salem; Arash Firouzbakht; Jiann Shiun Yuan

doi:10.1186/s13321-022-00590-y

Journal of Cheminformatics (Mar 2022)

MolData, a molecular benchmark for disease and target based machine learning

Arash Keshavarzi Arshadi,
Milad Salem,
Arash Firouzbakht,
Jiann Shiun Yuan

Affiliations

Arash Keshavarzi Arshadi: Burnett School of Biomedical Sciences, University of Central Florida
Milad Salem: Department of Electrical and Computer Engineering, University of Central Florida
Arash Firouzbakht: Department of Chemistry, University of Illinois at Urbana
Jiann Shiun Yuan: Department of Electrical and Computer Engineering, University of Central Florida

DOI: https://doi.org/10.1186/s13321-022-00590-y
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.

Published in Journal of Cheminformatics

ISSN: 1758-2946 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Chemistry
Website: https://jcheminf.biomedcentral.com/

About the journal

Abstract

Keywords