MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Haiping Zhang; Conghao Ma; Dongjin Yu; Liming Guan; Dongjing Wang; Zepeng Hu; Xu Liu

doi:10.1049/cvi2.12163

IET Computer Vision (Apr 2023)

MTSCANet: Multi temporal resolution temporal semantic context aggregation network

Haiping Zhang,
Conghao Ma,
Dongjin Yu,
Liming Guan,
Dongjing Wang,
Zepeng Hu,
Xu Liu

Affiliations

Haiping Zhang: School of Computer Science Hangzhou Dianzi University Hangzhou China
Conghao Ma: School of Electronics and Information Hangzhou Dianzi University Hangzhou China
Dongjin Yu: School of Computer Science Hangzhou Dianzi University Hangzhou China
Liming Guan: School of Information Engineering Hangzhou Dianzi University Hangzhou China
Dongjing Wang: School of Computer Science Hangzhou Dianzi University Hangzhou China
Zepeng Hu: School of Computer Science Hangzhou Dianzi University Hangzhou China
Xu Liu: School of Electronics and Information Hangzhou Dianzi University Hangzhou China

DOI: https://doi.org/10.1049/cvi2.12163
Journal volume & issue: Vol. 17, no. 3
pp. 366 – 378

Abstract

Read online

Abstract Temporal action localisation is a challenging task, and video context is crucial to localisation actions. Most existing cases that incorporate temporal and semantic contexts into video features suffer from single contextual representation and blurred temporal boundaries. In this study, a multi‐temporal resolution pyramid structure model is proposed. Firstly, a temporal‐semantic context aggregation module (TSCF) is designed to assign different attention weights to temporal contexts and combine them with multi‐level semantics into video features. Secondly, for the problem of large differences in the time span between different actions in the video, a local‐global attention module is designed to combine local and global temporal dependencies for each temporal point to obtain a more flexible and robust representation of contextual relations. The redundant representation of the convolution kernel is reduced by modifying the convolution and the arithmetic power is redeployed at a microscopic granularity. To verify the effectiveness of the model, extensive experiments on three challenging datasets are performed. On THUMOS14, the best performance is obtained in [email protected]–0.6 with an average mAP of 47.02%. On ActivityNet‐1.3, an average mAP of 34.94% was obtained. On HACS, an average mAP of 28.46% was achieved.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal

Abstract

Keywords