IEEE Access (Jan 2024)
The Visual Saliency Transformer Goes Temporal: TempVST for Video Saliency Prediction
Abstract
The Transformer revolutionized Natural Language Processing and Computer Vision by effectively capturing contextual relationships in sequential data through its attention mechanism. While Transformers have been explored sufficiently in traditional computer vision tasks such as image classification, their application to more intricate tasks, such as Video Saliency Prediction (VSP), remains limited. Video saliency prediction is the task of identifying the most visually salient regions in a video, which are likely to capture a viewer’s attention. In this study, we propose a pure transformer architecture named Temporal Visual Saliency Transformer (TempVST) for the VSP task. Our model leverages the Visual Saliency Transformer (VST) as a backbone, with the addition of a Transformer-based temporal module that can seamlessly transition diverse architectural frameworks from image to video domain, through the incorporation of temporal recurrences. Moreover, we demonstrate that transfer learning is viable in the context of VSP through Transformer architectures and helps reduce the duration of the training phase, leading to a reduction in the duration of the training phase by 41% and 45% in two different datasets. Our experiments were conducted on two benchmark datasets, DHF1K and LEDOV, and our results show that our network can compete with all other state-of-the-art models.
Keywords