IEEE Access (Jan 2024)

DCT-ViT: High-Frequency Pruned Vision Transformer With Discrete Cosine Transform

  • Jongho Lee,
  • Hyun Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3410231
Journal volume & issue
Vol. 12
pp. 80386 – 80396

Abstract

Read online

Transformers have demonstrated notable efficacy in computer vision, extending beyond their initial success in natural language processing. The application of vision transformers (ViTs) to resource-constrained mobile and edge devices is hampered by their extensive computational demands and large parameter sets. To address this, research has explored pruning redundant components of ViTs. Given that the computational burden of ViTs scales quadratically with token count, previous efforts have aimed to decrease the number of tokens or to linearize the computational cost of self-attention. However, such methods often incur significant accuracy losses due to the disruption of critical information pathways within the ViT, which primarily focuses on shape rather than texture, potentially aligning its image interpretation more closely with human perception than convolutional neural network (CNN) models. This observation parallels the effectiveness of JPEG, a predominant image compression standard, which maintains high compression efficacy with minimal quality degradation by discarding high-frequency details that have less impact on human object recognition. In this work, we harness the discrete cosine transform (DCT), an integral component of JPEG, to enhance ViT performance. We considerably reduced computational demands by selectively eliminating high-frequency tokens via DCT while maintaining model accuracy. For instance, our DCT-enhanced ViT model exhibited a 25% reduction in computational costs relative to the DeiT-Small model on ImageNet, with an accuracy increase of 0.18% and only a 0.72% accuracy decrease at a 44% computational reduction. Compared to the DeiT-Tiny model, our approach improved accuracy by 0.17% despite a 47% decrease in computational costs. Furthermore, the proposed DCT-ViT model necessitates significantly fewer parameters than existing approaches, offering a more efficient alternative for deploying ViTs on edge devices.

Keywords