Spectral Representation Learning and Fusion for Autonomous Vehicles Trip Description Exploiting Recurrent Transformer

Ghazala Rafiq; Muhammad Rafiq; Gyu Sang Choi

doi:10.1109/ACCESS.2023.3287783

IEEE Access (Jan 2023)

Spectral Representation Learning and Fusion for Autonomous Vehicles Trip Description Exploiting Recurrent Transformer

Ghazala Rafiq,
Muhammad Rafiq,
Gyu Sang Choi

Affiliations

Ghazala Rafiq: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, Republic of Korea
Muhammad Rafiq: ORCiD; Department of Game and Mobile Engineering, Keimyung University, Daegu, Republic of Korea
Gyu Sang Choi: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3287783
Journal volume & issue: Vol. 11
pp. 61437 – 61452

Abstract

Read online

A thorough analysis and comprehension of the entire cue set in visual data are indispensable for an ideal video description model. As outlined in recent algorithm proposals, video descriptions have primarily been generated by learning RGB and optical flow representations rather than exploring and incorporating the media’s spectral components referring to the patterns or characteristics in the distribution of colors or intensities across different frequencies or wavelengths of light. These components may enhance the description quality and impact the generated text for accuracy, diversity, and coherence. We propose a novel Fourier-based algorithm for extracting spectral features in 3D visual volume by decomposing the video signal into its frequency components, to fill this research gap. Further, the captured spectral features are fused with learned spatial and temporal representations in recurrent transformer architecture for accurate content understanding and appropriate description generation in natural language. The transformer includes an external memory module that produces summarized memory states based on the history of previously observed video fragments and already-generated sentences. These memory states ensure the establishment of sound semantic and linguistic cues. As a result, our proposed algorithm integrates spatial, temporal, spectral, and semantic representations for precise and grammatically accurate descriptions. The effectiveness of our proposed algorithm for the coherent and diverse video description is demonstrated through qualitative and quantitative experimentation on the DeepRide driving trip description dataset. A comprehensive ablation study validates the efficacy of the spectral features fusion with spatial and temporal visual representations for the rich video-to-textual narration generation.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords