Remote Sensing (Apr 2020)

Improved Inference and Prediction for Imbalanced Binary Big Data Using Case-Control Sampling: A Case Study on Deforestation in the Amazon Region

  • Denis Valle,
  • Jacy Hyde,
  • Matthew Marsik,
  • Stephen Perz

DOI
https://doi.org/10.3390/rs12081268
Journal volume & issue
Vol. 12, no. 8
p. 1268

Abstract

Read online

It is computationally challenging to fit models to big data. For example, satellite imagery data often contain billions to trillions of pixels and it is not possible to use a pixel-level analysis to identify drivers of land-use change and create predictions using all the data. A common strategy to reduce sample size consists of drawing a random sample but this approach is not ideal when the outcome of interest is rare in the landscape because it leads to very few pixels with this outcome. Here we show that a case-control (CC) sampling approach, in which all (or a large fraction of) pixels with the outcome of interest and a subset of the pixels without this outcome are selected, can yield much better inference and prediction than random sampling (RS) if the estimated parameters and probabilities are adjusted with the equations that we provide. More specifically, we show that a CC approach can yield unbiased inference with much less uncertainty when CC data are analyzed with logistic regression models and its semiparametric variants (e.g., generalized additive models). We also show that a random forest model, when fitted to CC data, can generate much better predictions than when fitted to RS data. We illustrate this improved performance of the CC approach, when used together with the proposed bias-correction adjustments, with extensive simulations and a case study in the Amazon region focused on deforestation.

Keywords