Wellcome Open Research (Jan 2023)

Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life [version 1; peer review: 2 approved]

  • Richard Challis,
  • Cibele Sotero-Caio,
  • Sujai Kumar,
  • Max Brown,
  • Mark Blaxter

Journal volume & issue
Vol. 8

Abstract

Read online

As genomic data transform our understanding of biodiversity, the Earth BioGenome Project (EBP) has set a goal of generating reference quality genome assemblies for all ~1.9 million described eukaryotic taxa. Meeting this goal requires coordination among many individual regional and taxon-focussed projects working under the EBP umbrella. Large-scale sequencing projects require ready access to validated genome-relevant metadata, such as genome sizes and karyotypes, but these data are dispersed across the literature, and directly measured values are lacking for most taxa. To meet these needs, we have developed Genomes on a Tree (GoaT), an Elasticsearch-powered datastore and search index for genome-relevant metadata and sequencing project plans and statuses. GoaT indexes publicly available metadata for all eukaryotic species and interpolates missing values through phylogenetic comparison. GoaT also holds target priority and sequencing status information for many projects affiliated to the EBP to aid project coordination. Metadata and status attributes in GoaT can be queried through a mature API, a web front end, and a command line interface. The web front end additionally provides summary visualisations for data exploration and reporting (see https://goat.genomehubs.org). GoaT currently holds direct or estimated values for over 70 taxon attributes and over 30 assembly attributes across 1.5 million eukaryotic species. The depth and breadth of curated data, frequent updates, and a versatile query interface make GoaT a powerful data aggregator and portal to explore and report underlying data for the eukaryotic tree of life. We illustrate this utility through a series of use cases from planning through to completion of a genome-sequencing project.

Keywords