Loop and distillation: Attention weights fusion transformer for fine‐grained representation

Sun Fayou; Hea Choon Ngo; Zuqiang Meng; Yong Wee Sek

doi:10.1049/cvi2.12181

IET Computer Vision (Jun 2023)

Loop and distillation: Attention weights fusion transformer for fine‐grained representation

Sun Fayou,
Hea Choon Ngo,
Zuqiang Meng,
Yong Wee Sek

Affiliations

Sun Fayou: Guangxi University Nanning Guangxi China
Hea Choon Ngo: Universiti Teknikal Malaysia Melaka Durian Tunggal Melaka Malaysia
Zuqiang Meng: Guangxi University Nanning Guangxi China
Yong Wee Sek: Universiti Teknikal Malaysia Melaka Durian Tunggal Melaka Malaysia

DOI: https://doi.org/10.1049/cvi2.12181
Journal volume & issue: Vol. 17, no. 4
pp. 473 – 482

Abstract

Read online

Abstract Learning subtle discriminative feature representation plays a significant role in Fine‐Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi‐head self‐attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one‐time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end‐to‐end in a weakly‐supervised learning way. Extensive experiments demonstrate that RDTrans yields state‐of‐the‐art performance on four widely used fine‐grained benchmarks, including CUB‐200‐2011, Stanford Cars, Stanford Dogs, and iNat2017.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal

Abstract

Keywords