IEEE Access (Jan 2023)
Discrete Wavelet Transform Meets Transformer: Unleashing the Full Potential of the Transformer for Visual Recognition
Abstract
Traditionally, the success of the Transformer has been attributed to its token mixer, particularly the self-attention mechanism. However, recent studies suggest that replacing such attention-based token mixer with alternative techniques can yield comparable results in various vision tasks. This highlights the importance of the model’s overall architectural structure in achieving optimal performance, rather than relying exclusively on the specific choice of the token mixer. Building on this insight, we introduce Discrete Wavelet TransFormer, an innovative framework that incorporates Discrete Wavelet Transform to elevate all the building blocks of the Transformer to a higher standard. By exploiting distinct attributes of Discrete Wavelet Transform, Discrete Wavelet TransFormer not only strengthens the network’s ability to learn more intricate feature representations across different levels of abstraction, but also facilitates lossless down-sampling to promote a more resilient and efficient network. A comprehensive evaluation is conducted on diverse vision tasks, and the results conclusively demonstrate that Discrete Wavelet TransFormer outperforms all other state-of-the-art Transformer-based models across all tasks by a significant margin.
Keywords