IEEE Access (Jan 2025)
EMSFomer: Efficient Multi-Scale Transformer for Real-Time Semantic Segmentation
Abstract
Transformer-based models have achieved impressive performance in semantic segmentation in recent years. However, the multi-head self-attention mechanism in Transformers incurs significant computational overhead and becomes impractical for real-time applications due to its high complexity and large latency. Numerous attention variants have been proposed to address this issue, yet the overall performance and inference speed still have limitations. In this paper, we propose an efficient multi-scale Transformer (EMSFormer) that employs learnable keys and values based on the single-head attention mechanism and a dual-resolution structure for real-time semantic segmentation. Specifically, we propose multi-scale single-head attention (MS-SHA) to effectively learn multi-scale attention and improve feature representation capability. In addition, we introduce cross-resolution single-head attention (CR-SHA) to efficiently combine the global context-rich features in the low-resolution branch to the features in the high-resolution branch. Experimental results show that our proposed method can achieve state-of-the-art performance with real-time inference speed on ADE20K, Cityscapes, and CamVid datasets.
Keywords