findMySequence: a neural-network-based approach for identification of unknown proteins in X-ray crystallography and cryo-EM
Grzegorz Chojnowski,
Adam J. Simpkin,
Diego A. Leonardo,
Wolfram Seifert-Davila,
Dan E. Vivas-Ruiz,
Ronan M. Keegan,
Daniel J. Rigden
Affiliations
Grzegorz Chojnowski
European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany
Adam J. Simpkin
Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
Diego A. Leonardo
São Carlos Institute of Physics, University of São Paulo, Avenida João Dagnone 1100, São Carlos, SP 13563-120, Brazil
Wolfram Seifert-Davila
European Molecular Biology Laboratory, Meyerhofstraße 1, 69117 Heidelberg, Germany
Dan E. Vivas-Ruiz
Laboratorio de Biología Molecular, Facultad de Ciencias Biológicas, Universidad Nacional Mayor de San Marcos, Avenida Venezuela Cdra 34 S/N, Ciudad Universitaria, Lima, Peru
Ronan M. Keegan
Rutherford Appleton Laboratory, Research Complex at Harwell, UKRI-STFC, Didcot OX11 0FA, United Kingdom
Daniel J. Rigden
Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
Although experimental protein-structure determination usually targets known proteins, chains of unknown sequence are often encountered. They can be purified from natural sources, appear as an unexpected fragment of a well characterized protein or appear as a contaminant. Regardless of the source of the problem, the unknown protein always requires characterization. Here, an automated pipeline is presented for the identification of protein sequences from cryo-EM reconstructions and crystallographic data. The method's application to characterize the crystal structure of an unknown protein purified from a snake venom is presented. It is also shown that the approach can be successfully applied to the identification of protein sequences and validation of sequence assignments in cryo-EM protein structures.