Ecological Informatics (Dec 2024)

The retrospective double-entry of a long-term ecological dataset

  • Simon Bull,
  • Robert Sharrad,
  • Michael G. Gardner

Journal volume & issue
Vol. 84
p. 102873

Abstract

Read online

Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.

Keywords