Vision Transformer-Based Tailing Detection in Videos

Jaewoo Lee; Sungjun Lee; Wonki Cho; Zahid Ali Siddiqui; Unsang Park

doi:10.3390/app112411591

Applied Sciences (Dec 2021)

Vision Transformer-Based Tailing Detection in Videos

Jaewoo Lee,
Sungjun Lee,
Wonki Cho,
Zahid Ali Siddiqui,
Unsang Park

Affiliations

Jaewoo Lee: Department of Computer Science and Engineering, Sogang University, Mapo-gu, Seoul 04107, Korea
Sungjun Lee: Department of Computer Science and Engineering, Sogang University, Mapo-gu, Seoul 04107, Korea
Wonki Cho: Department of Computer Science and Engineering, Sogang University, Mapo-gu, Seoul 04107, Korea
Zahid Ali Siddiqui: Department of Computer Science and Engineering, Sogang University, Mapo-gu, Seoul 04107, Korea
Unsang Park: Department of Computer Science and Engineering, Sogang University, Mapo-gu, Seoul 04107, Korea

DOI: https://doi.org/10.3390/app112411591
Journal volume & issue: Vol. 11, no. 24
p. 11591

Abstract

Read online

Tailing is defined as an event where a suspicious person follows someone closely. We define the problem of tailing detection from videos as an anomaly detection problem, where the goal is to find abnormalities in the walking pattern of the pedestrians (victim and follower). We, therefore, propose a modified Time-Series Vision Transformer (TSViT), a method for anomaly detection in video, specifically for tailing detection with a small dataset. We introduce an effective way to train TSViT with a small dataset by regularizing the prediction model. To do so, we first encode the spatial information of the pedestrians into 2D patterns and then pass them as tokens to the TSViT. Through a series of experiments, we show that the tailing detection on a small dataset using TSViT outperforms popular CNN-based architectures, as the CNN architectures tend to overfit with a small dataset of time-series images. We also show that when using time-series images, the performance of CNN-based architecture gradually drops, as the network depth is increased, to increase its capacity. On the other hand, a decreasing number of heads in Vision Transformer architecture shows good performance on time-series images, and the performance is further increased as the input resolution of the images is increased. Experimental results demonstrate that the TSViT performs better than the handcrafted rule-based method and CNN-based method for tailing detection. TSViT can be used in many applications for video anomaly detection, even with a small dataset.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords