Environmental Research: Ecology (Jan 2025)
Impacts of benchmarking choices on inferred model skill of the Arctic–Boreal terrestrial carbon cycle
Abstract
Land surface models require continuous validation against observations to improve and reduce simulation uncertainty. However, inferred model performance can be heavily influenced by subjective choices made in the selection and application of observational data products. A key area often misrepresented by models is the Arctic–Boreal region, which is a potential tipping point region in Earth’s climate system due to large permafrost carbon stocks that are vulnerable to release with climate warming. We use the International Land Model Benchmarking (ILAMB) framework to evaluate how the model skill of TRENDY-v9 models varies based on the choice of observational-based benchmark and how benchmarks are applied in model evaluation. This analysis uses global datasets integrated into ILAMB and new, regionally-specific observational products from the Arctic–Boreal Vulnerability Experiment. Our results cover the overall time period of 1979–2019 and show that model scores can vary substantially depending on the data product applied, with higher model scores indicating better model performance against observations. The lowest model scores occur when benchmarked against regional, compared to global, datasets. We also evaluate observed and modeled functional relationships between ecosystem respiration and air temperature and between gross primary production and precipitation. Here, we find that the magnitude and shape of the responses are strongly impacted by the choice of observational dataset and the approach used to construct the functional relationship benchmark. These results suggest that model evaluation studies could conclude a false sense of model skill if only using a single benchmark data product or if not applying regional data products when performing a regional model analysis. Collectively, our findings highlight the influence of benchmarking choices on model evaluation and point to the need for benchmarking guidelines when assessing model skill.
Keywords