Egyptian Informatics Journal (Sep 2024)

YOLO-HyperVision: A vision transformer backbone-based enhancement of YOLOv5 for detection of dynamic traffic information

  • Shizhou Xu,
  • Mengjie Zhang,
  • Jingyu Chen,
  • Yiming Zhong

Journal volume & issue
Vol. 27
p. 100523

Abstract

Read online

With the increase of traffic flow in modern urban areas, traffic congestion has become a serious problem that affects people’s normal production and life. Using target detection technology instead of manual labor can quickly detect the road traffic situation and provide timely information about the traffic flow. However, when using drones to observe the traffic flow in the air, the perspective effect will cause the detected vehicles and pedestrians to be very small, and the scale difference between different categories of targets is large, which increases the detection difficulty of a single convolutional neural network model. In order to solve the problem of low accuracy of traditional single-stage target detection models, this study proposes an improved Yolov5 vehicle target detection model with Vision Transformer (VIT) backbone, You Only Look Once-HyperVision (YOLO-HV), which aims to solve the problem of poor multi-scale target recognition performance caused by the inability of traditional CNN networks to integrate contextual information, and help drones achieve more efficient and accurate traffic flow recognition functions. This study deeply integrates the Vision Transformer (VIT) backbone and Convolutional Neural Network (CNN), effectively combining the multi-scale detection advantages of Vision Transformer and the inductive bias ability of Convolutional Neural Network, and adds multi-scale residual modules and context correlation enhancement modules, which greatly improves the recognition accuracy of single-stage detectors for drone images. Through comparative experiments on the VisDrone dataset, it is found that the detection performance of this model is improved compared with several commonly used detection models. YOLO-HV can increase the mean average precision (mAP) by 3.3% compared with the pure convolutional network of the same model size. YOLO-HV model has achieved excellent performance in the task of traffic flow image detection taken by drones, and can more accurately identify and classify road vehicles than various target detection models.

Keywords