IET Computer Vision (Jun 2023)

Loop and distillation: Attention weights fusion transformer for fine‐grained representation

  • Sun Fayou,
  • Hea Choon Ngo,
  • Zuqiang Meng,
  • Yong Wee Sek

DOI
https://doi.org/10.1049/cvi2.12181
Journal volume & issue
Vol. 17, no. 4
pp. 473 – 482

Abstract

Read online

Abstract Learning subtle discriminative feature representation plays a significant role in Fine‐Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi‐head self‐attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one‐time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end‐to‐end in a weakly‐supervised learning way. Extensive experiments demonstrate that RDTrans yields state‐of‐the‐art performance on four widely used fine‐grained benchmarks, including CUB‐200‐2011, Stanford Cars, Stanford Dogs, and iNat2017.

Keywords