Patterns (Jul 2020)
A Searchable Database of Crystallization Cocktails in the PDB: Analyzing the Chemical Condition Space
Abstract
Summary: Nearly 90% of structural models in the Protein Data Bank (PDB), the central resource worldwide for three-dimensional structural information, are currently derived from macromolecular crystallography (MX). A major bottleneck in determining MX structures is finding conditions in which a biomolecule will crystallize. Here, we present a searchable database of the chemicals associated with successful crystallization experiments from the PDB. We use these data to examine the relationship between protein secondary structure and average molecular weight of polyethylene glycol and to investigate patterns in crystallization conditions. Our analyses reveal striking patterns of both redundancy of chemical compositions in crystallization experiments and extreme sparsity of specific chemical combinations, underscoring the challenges faced in generating predictive models for de novo optimal crystallization experiments. The Bigger Picture: Determining structures of biological macromolecules is critical to advancing drug discovery and medical research. The majority (∼90%) of structures in the Protein Data Bank (PDB) derive from X-ray crystallography. To obtain a crystal structure, the first thing you need is a crystal. A key bottleneck to crystallographic methods is finding conditions in which a sample will crystallize. In addition to three-dimensional structural files, the PDB contains abundant metadata on crystallization details. Mining these data could unlock the bottleneck and facilitate structure acquisition. Crucial metadata on crystallization conditions are in free text fields in the PDB; parsing these data on a large scale is challenging. We have developed a tool to facilitate extraction and standardization. We provide the extraction tool, a curated dataset, and analyses of these metadata. This study enables PDB data mining by providing a customizable tool capable of imposing a controlled vocabulary on free text PDB metadata.