Data in Brief (Jun 2022)

A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides

  • Shivalika Tanwar,
  • Patrick Auberger,
  • Germain Gillet,
  • Mario DiPaola,
  • Katya Tsaioun,
  • Bruno O. Villoutreix

Journal volume & issue
Vol. 42
p. 108159

Abstract

Read online

Drug discovery often requires the identification of off-targets as the binding of a compound to targets other than the intended target(s) can be beneficial in some cases or detrimental in other situations (e.g., binding to anti-targets). Such investigations are also of importance during the early stage of a project, for example when the target is not known (e.g., phenotypic screening). Target identification can be performed in-vitro, but various in-silico methods have also been developed in recent years to facilitate target identification and help generate ideas. FastTargetPred is one such approach, it is a freely available Python/C program that attempts to predict putative macromolecular targets (i.e., target fishing) for a single input small molecule query or an entire compound collection using established chemical similarity search approaches. Indeed, the putative macromolecular target(s) of a small chemical compound can be predicted by identifying ligands that are known experimentally to bind to some targets and that are structurally similar to the input query chemical compound. Therefore, this type of target fishing approach relies on a large collection of experimentally validated macromolecule-chemical compound binding data. The small chemical compounds can be described as molecular fingerprints encoding their structural characteristics as a vector. The published version of FastTargetPred used ligand-target binding data extracted from the release 25 (2019) of the ChEMBL database. Here we provide a new dataset for FastTargetPred extracted from the last ChEMBL release, namely, at the time of writing, ChEMBL29 (2021). Four fingerprints were computed (ECFP4, ECFP6, MACCS and PL) for the extracted compound dataset (714,780 unique ChEMBL29 compounds while the entire ChEMBL29 database contained about 2.1 million compounds). However, it was not possible to compute fingerprints for 19 molecules because of their unusual chemistry (complex macrocycles). These data files were then prepared so as to be compatible with FastTargetPred requirements. The 714,761 ChEMBL chemical compounds with computed fingerprints hit 6,477 macromolecular targets based on the selected criteria. For these ChEMBL compounds a ChEMBL target ID is reported and these target IDs were matched with the corresponding UniProt IDs. Thus, when available, the UniProt ID is provided, the protein UniProt name, the gene name, the organism as well as annotated involvement in diseases, gene ontology data, and cross-references to the Reactome pathway database. As short peptides can be of interest for drug discovery and chemical biology endeavours, we were interested in attempting to predict putative macromolecular targets for a previously reported exhaustive combination of peptides containing four natural amino acids (i.e., 20 × 20 × 20 × 20 = 160,000 linear tetrapeptides) using FastTargetPred and the presently generated ChEMBL29 dataset. With the parameters used, putative targets are reported for 63,944 unique query peptides. These target predictions are provided in two different searchable files with hyperlinks to the ChEMBL, UniProt and Reactome databases.

Keywords