IEEE Access (Jan 2025)
Hierarchical Feature Attention Learning Network for Detecting Object and Discriminative Parts in Fine-Grained Visual Classification
Abstract
This paper proposes a novel hierarchical feature attention learning network for improved fine-grained visual classification (FGVC). Existing fine-grained classification methods rely heavily on attention mechanisms to differentiate minute details of similar objects. These mechanisms often assume that critical locations have a similar scale and are uniquely localizable, which is not always accurate. For instance, the size of a bird may vary across images, and the color of its beak might only be significant for species identification when its wing and tail colors are specific. This paper addresses this limitation by proposing a so-called hierarchical feature attention learning network, which initially focuses on the target object within the image, followed by multi-headed attention to identify key discriminative locations (patches). Especially, we develop a novel hierarchical attention approach that appropriately reduces misleading attentions by considering the object’s size for capturing correct attention parts. In addition, the proposed multi-headed attention allows for examining more complementary attention parts to identify the most discriminative features. Further, our framework is implemented as an architectural constraint, eliminating the need for object or part-level annotations in a weakly supervised detection manner. We conducted extensive and comparative experiments on three benchmark datasets: NABirds, CUB-200, and Oxford 102 Flower. The results demonstrate that our proposed hierarchical attention approach provides a robust and efficient solution for improved FGVC. Specifically, our method achieved a top-1 accuracy increase of approximately 93.0%, 92.7%, and 99.4% on the CUB-200-2011, NABirds, and Oxford 102 Flower benchmarks, respectively.
Keywords