On the design of linked datasets mapping networks of collaboration in the genomic sequencing of Saccharomyces cerevisiae, Homo sapiens, and Sus scrofa [version 3; peer review: 2 approved]

Rhodri Leng; Mark Wong

F1000Research (Feb 2023)

On the design of linked datasets mapping networks of collaboration in the genomic sequencing of Saccharomyces cerevisiae, Homo sapiens, and Sus scrofa [version 3; peer review: 2 approved]

Rhodri Leng,
Mark Wong

Affiliations

Rhodri Leng: Science, Technology and Innovation Studies, University of Edinburgh, Edinburgh, EH1 1LZ, UK
Mark Wong: ORCiD; Urban Studies, School of Social and Political Sciences, University of Glasgow, Glasgow, G12 8QQ, UK

Journal volume & issue: Vol. 8

Abstract

Read online

This data note describes a unique two-step methodology to construct six linked datasets covering the sequencing of Saccharomyces cerevisiae, Homo sapiens, and Sus scrofa genomes. The datasets were used as evidence in a project that investigated the history of genomic science. To design the datasets, we first retrieved all sequence submission data from the European Nucleotide Archive (ENA), including accession numbers associated with each of our three species. Second, we used these accession numbers to construct queries to retrieve peer-reviewed scientific publications that first described these sequence submissions in the scientific literature. For each species, this resulted in two associated datasets: 1) A .csv file documenting the PMID of each article describing new sequences, all paper authors, all institutional affiliations of each author, countries of institution, year of first submission to the ENA (when available), and the year of article publication, and 2) A .csv file documenting all institutions submitting to the ENA, number of nucleotides sequenced and years of submission to the database. We utilised these datasets to understand how institutional collaboration shaped sequencing efforts, and to systematically identify important institutions and changes in the structure of research communities throughout the history of genomics and across our three target species. This data note, therefore, should aid researchers who would like to use these data for future analyses by making the methodology that underpins it transparent. Further, by detailing our methodology, researchers may be able to utilise our approach to construct similar datasets in the future.

Published in F1000Research

ISSN: 2046-1402 (Online)
Publisher: F1000 Research Ltd
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://f1000research.com

About the journal

Abstract

Keywords