Applied Sciences (Jul 2024)

Optimizing Mobile Vision Transformers for Land Cover Classification

  • Papia F. Rozario,
  • Ravi Gadgil,
  • Junsu Lee,
  • Rahul Gomes,
  • Paige Keller,
  • Yiheng Liu,
  • Gabriel Sipos,
  • Grace McDonnell,
  • Westin Impola,
  • Joseph Rudolph

DOI
https://doi.org/10.3390/app14135920
Journal volume & issue
Vol. 14, no. 13
p. 5920

Abstract

Read online

Image classification in remote sensing and geographic information system (GIS) data containing various land cover classes is essential for efficient and sustainable land use estimation and other tasks like object detection, localization, and segmentation. Deep learning (DL) techniques have shown tremendous potential in the GIS domain. While convolutional neural networks (CNNs) have dominated image analysis, transformers have proven to be a unifying solution for several AI-based processing pipelines. Vision transformers (ViTs) can have comparable and, in some cases, better accuracy than a CNN. However, they suffer from a significant drawback associated with the excessive use of training parameters. Using trainable parameters generously can have multiple advantages ranging from addressing model scalability to explainability. This can have a significant impact on model deployment in edge devices with limited resources, such as drones. In this research, we explore, without using pre-trained weights, how the inherent structure of vision transformers behaves with custom modifications. To verify our proposed approach, these architectures are trained on multiple land cover datasets. Experiments reveal that a combination of lightweight convolutional layers, including ShuffleNet, along with depthwise separable convolutions and average pooling can reduce the trainable parameters by 17.85% and yet achieve higher accuracy than the base mobile vision transformer (MViT). It is also observed that utilizing a combination of convolution layers along with multi-headed self-attention layers in MViT variants provides better performance for capturing local and global features, unlike the standalone ViT architecture, which utilizes almost 95% more parameters than the proposed MViT variant.

Keywords