PeerJ (Aug 2025)

Use ATCCfinder to identify commercially available American Type Culture Collection strains based on sequence queries

  • Samuel I. Koehler,
  • Earl A. Middlebrook,
  • Blake T. Hovde,
  • Erik R. Hanschen

DOI
https://doi.org/10.7717/peerj.19832
Journal volume & issue
Vol. 13
p. e19832

Abstract

Read online Read online

Microbiology research was conducted for decades before widespread availability of sequencing resources and large culture collection sequence repositories, making it challenging to efficiently identify and validate strains used in historical studies. Similarly, finding commercially available microbe strains similar to strains of interest, or containing target genes of interest found during metagenomic experiments is challenging. Despite tremendous advances in sequencing data availability, database curation, and sequence-searching software capabilities, identifying commercially available microbe strains from sequence data remains complicated and tedious. The American Type Culture Collection (ATCC) is an organization selling a wide variety of microbes, uniquely providing strain-level taxonomy classification and associated sequenced reference genomes for over four thousand isolates, with more being added regularly. As researchers purchase and sequence isolates from ATCC, many sequences derived from ATCC isolates are deposited on public databases such as NCBI-Genome. Sequences uploaded to public databases will vary in laboratory, bioinformatics, and metadata quality and can also contain mutations derived from cultivation which are not representative of ATCC stocks. Using ATCC-sourced reference genomes ensures consistent quality and analysis methodologies are implemented to accurately represent strain sequences. Currently, ATCC does not provide methods to search for sequence similarity between many query sequences and ATCC genomes. While NCBI-BLAST could be used to search for queries against GenBank, with results filtered for “ATCC” tags, search result quality varies and requires time-consuming sorting. Here we present the software ATCCfinder (GitHub: https://github.com/lanl/ATCCfinder, Zenodo: https://doi.org/10.5281/zenodo.15178103), utilizing ATCC application interface software (API) to generate query-able databases from ATCC genome resources. The algorithm generates databases of the four ATCC data types: strain-specific genome assembly sequence data (sequence), information about how each strain was collected (metadata, catalogue), and structural/functional information about genome assemblies (annotation). Once ATCC sequences are retrieved by ATCCfinder, nucleotide queries are compared against ATCC reference genomes via sequence alignment tool minimap2, with results parsed and analyzed to produce summaries describing ATCC-available strain homologous sequence matches. ATCCfinder identifies and downloads new ATCC references, allowing users to maintain an updated target search database. ATCCfinder efficiently accesses, queries, and summarizes ATCC resources, identifying purchasable strains homologous to historical sequences, functional genes, operons, and other genetic components.

Keywords