IEEE Access (Jan 2024)

Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers

  • Xiao Yu,
  • Yunfeng Hua,
  • Siyun Zhang,
  • Zhaocheng Xu

DOI
https://doi.org/10.1109/ACCESS.2024.3507272
Journal volume & issue
Vol. 12
pp. 178971 – 178981

Abstract

Read online

Achieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features and incur high computational costs, resulting in subpar segmentation accuracy and slow inference speed. In this paper, we introduce the Human Spatial Prior Module (HSPM) and Dynamic Token Pruning Module (DTPM). The HSPM is specifically designed to capture human features in distorted images, using dynamic methods to extract highly variable details. The DTPM accelerates inference by pruning unimportant tokens from each layer of the Vision Transformer (ViT). Unlike traditional cropping approaches, the cropped tokens are preserved using feature maps and selectively reactivated in subsequent network layers to improve model performance. To validate the effectiveness of Vision Transformer in Distorted Image (ViT-DI), we extend the ADE20K dataset and conduct experiments on the constructed dataset and the Cityscapes dataset. Our method achieves an mIoU increase of 1.6 and an FPS increase of 4.4 on the ADE20K dataset, and an mIoU increase of 0.77 and an FPS increase of 2.9 on the Cityscapes dataset, with a reduction in model size of approximately 130 GFLOPs. The URL to our dataset is: https://github.com/GitHubYuxiao/ViT-DI.

Keywords