Audio-Visual Action Recognition Using Transformer Fusion Network

Jun-Hwa Kim; Chee Sun Won

doi:10.3390/app14031190

Applied Sciences (Jan 2024)

Audio-Visual Action Recognition Using Transformer Fusion Network

Jun-Hwa Kim,
Chee Sun Won

Affiliations

Jun-Hwa Kim: Department of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Republic of Korea
Chee Sun Won: Department of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Republic of Korea

DOI: https://doi.org/10.3390/app14031190
Journal volume & issue: Vol. 14, no. 3
p. 1190

Abstract

Read online

Our approach to action recognition is grounded in the intrinsic coexistence of and complementary relationship between audio and visual information in videos. Going beyond the traditional emphasis on visual features, we propose a transformer-based network that integrates both audio and visual data as inputs. This network is designed to accept and process spatial, temporal, and audio modalities. Features from each modality are extracted using a single Swin Transformer, originally devised for still images. Subsequently, these extracted features from spatial, temporal, and audio data are adeptly combined using a novel modal fusion module (MFM). Our transformer-based network effectively fuses these three modalities, resulting in a robust solution for action recognition.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords