End-to-End: A Simple Template for the Long-Tailed-Recognition of Transmission Line Clamps via a Vision-Language Model

Fei Yan; Hui Zhang; Yaogen Li; Yongjia Yang; Yinping Liu

doi:10.3390/app13053287

Applied Sciences (Mar 2023)

End-to-End: A Simple Template for the Long-Tailed-Recognition of Transmission Line Clamps via a Vision-Language Model

Fei Yan,
Hui Zhang,
Yaogen Li,
Yongjia Yang,
Yinping Liu

Affiliations

Fei Yan: College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
Hui Zhang: College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
Yaogen Li: College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
Yongjia Yang: College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
Yinping Liu: College of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China

DOI: https://doi.org/10.3390/app13053287
Journal volume & issue: Vol. 13, no. 5
p. 3287

Abstract

Read online

Raw image classification datasets generally maintain a long-tailed distribution in the real world. Standard classification algorithms face a substantial issue because many labels only relate to a few categories. The model learning processes will tend toward the dominant labels under the influence of their loss functions. Existing systems typically use two stages to improve performance: pretraining on initial imbalanced datasets and fine-tuning on balanced datasets via re-sampling or logit adjustment. These have achieved promising results. However, their limited self-supervised information makes it challenging to transfer such systems to other vision tasks, such as detection and segmentation. Using large-scale contrastive visual-language pretraining, the Open AI team discovered a novel visual recognition method. We provide a simple one-stage model called the text-to-image network (TIN) for long-tailed recognition (LTR) based on the similarities between textual and visual features. The TIN has the following advantages over existing techniques: (1) Our model incorporates textual and visual semantic information. (2) This end-to-end strategy achieves good results with fewer image samples and no secondary training. (3) By using seesaw loss, we further reduce the loss gap between the head category and the tail category. These adjustments encourage large relative magnitudes between the logarithms of rare and dominant labels. TIN conducted extensive comparative experiments with a large number of advanced models on ImageNet-LT, the largest long-tailed public dataset, and achieved the state-of-the-art for a single-stage model with 72.8% at Top-1 accuracy.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords