GIScience & Remote Sensing (Dec 2022)

Iterative self-organizing SCEne-LEvel sampling (ISOSCELES) for large-scale building extraction

  • Benjamin Swan,
  • Melanie Laverdiere,
  • H. Lexie Yang,
  • Amy Rose

DOI
https://doi.org/10.1080/15481603.2021.2006433
Journal volume & issue
Vol. 59, no. 1
pp. 1 – 16

Abstract

Read online

Convolutional neural networks (CNN) provide state-of-the-art performance in many computer vision tasks, including those related to remote-sensing image analysis. Successfully training a CNN to generalize well to unseen data, however, requires training on samples that represent the full distribution of variation of both the target classes and their surrounding contexts. With remote sensing data, acquiring a sufficiently representative training set is a challenge due to both the inherent multi-modal variability of satellite or aerial imagery and the general high cost of labeling data. To address this challenge, we have developed ISOSCELES, an Iterative Self-Organizing SCEne LEvel Sampling method for hierarchical sampling of large image sets. Using affinity propagation, ISOSCELES automates the selection of highly representative training images. Compared to random sampling or using available reference data, the distribution of the training is principally data driven, reducing the chance of oversampling uninformative areas or undersampling informative ones. In comparison to manual sample selection by an analyst, ISOSCELES exploits descriptive features, spectral and/or textural, and eliminates human bias in sample selection. Using a hierarchical sampling approach, ISOSCELES can obtain a training set that reflects both between-scene variability, such as in viewing angle and time of day, and within-scene variability at the level of individual training samples. We verify the method by demonstrating its superiority to stratified random sampling in the challenging task of adapting a pre-trained model to a new image and spatial domain for country-scale building extraction. Using a pair of hand-labeled training sets comprising 1,987 sample image chips, a total of 496,000,000 individually labeled pixels, we show, across three distinct model architectures, an increase in accuracy, as measured by F1-score, of 2.2–4.2%.

Keywords