Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany; Jacobs University Bremen, Bremen, Germany
Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany; Department of Medicine, University of Chicago, Chicago, United States
Silvia G Acinas
Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Barcelona, Spain
Albert Barberán
Department of Environmental Science, University of Arizona, Tucson, United States
Pier Luigi Buttigieg
Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Bremerhaven, Germany
Department of Medicine, University of Chicago, Chicago, United States; Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, United States
Robert D Finn
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
Renzo Kottmann
Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany
Alex Mitchell
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Barcelona, Spain
Kimmo Siren
Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
Martin Steinegger
School of Biological Sciences, Seoul National University, Seoul, Republic of Korea; Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea
Frank Oliver Gloeckner
Jacobs University Bremen, Bremen, Germany; University of Bremen and Life Sciences and Chemistry, Bremen, Germany; Computing Center, Helmholtz Center for Polar and Marine Research, Bremerhaven, Germany
Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Bremen, Germany; Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.