Scientific Data (Nov 2024)
A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
Abstract
Abstract Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides.