Scientific Data (Nov 2023)

A globally synthesised and flagged bee occurrence dataset and cleaning workflow

  • James B. Dorey,
  • Erica E. Fischer,
  • Paige R. Chesshire,
  • Angela Nava-Bolaños,
  • Robert L. O’Reilly,
  • Silas Bossert,
  • Shannon M. Collins,
  • Elinor M. Lichtenberg,
  • Erika M. Tucker,
  • Allan Smith-Pardo,
  • Armando Falcon-Brindis,
  • Diego A. Guevara,
  • Bruno Ribeiro,
  • Diego de Pedro,
  • John Pickering,
  • Keng-Lou James Hung,
  • Katherine A. Parys,
  • Lindsie M. McCabe,
  • Matthew S. Rogan,
  • Robert L. Minckley,
  • Santiago J. E. Velazco,
  • Terry Griswold,
  • Tracy A. Zarrillo,
  • Walter Jetz,
  • Yanina V. Sica,
  • Michael C. Orr,
  • Laura Melissa Guzman,
  • John S. Ascher,
  • Alice C. Hughes,
  • Neil S. Cobb

DOI
https://doi.org/10.1038/s41597-023-02626-w
Journal volume & issue
Vol. 10, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We present BeeBDC, a new R package, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducible BeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. The BeeBDC package with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducible R workflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.