RoadSitu: Leveraging road video frame extraction and three-stage transformers for situation recognition

Subhajit Chatterjee; Hoorang Shin; Joon-Min Gil; Yung-Cheol Byun

Results in Engineering (Dec 2024)

RoadSitu: Leveraging road video frame extraction and three-stage transformers for situation recognition

Subhajit Chatterjee,
Hoorang Shin,
Joon-Min Gil,
Yung-Cheol Byun

Affiliations

Subhajit Chatterjee: Department of Computer Engineering, Jeju National University, Jeju 63243, South Korea
Hoorang Shin: INUC Co., Ltd., #111 Hyupjae2-gil 37, Hanlim-eup Jeju, South Korea
Joon-Min Gil: Department of Computer Engineering, Jeju National University, Jeju 63243, South Korea
Yung-Cheol Byun: Department of Computer Engineering, Major of Electronic Engineering, Jeju National University, Institute of Information Science & Technology, Jeju 63243, South Korea; Corresponding author.

Journal volume & issue: Vol. 24
p. 103197

Abstract

Read online

Situation recognition is an crucial problem in scene understanding, activity understanding, and action reasoning as it provides a structured representation of the main activity depicted in the image. Semantic role labeling is crucial to situation recognition, which is challenging because a single action can have multiple meanings and purposes depending on its context. Understanding images beyond the highlighted actions requires inferences about the context of the scene, the objects, and their role in the captured event. Recently, situation recognition (SR) has been introduced, which jointly derives a collection of the action (activity), meaning-role, and noun (entities) pairs in the form of moving images. To label these frames as action frames, we must assign nouns (entities) to the role based on the content of the observed image. One of the main challenges is managing the complex dependencies between the assigned roles (nouns) and the predicted action, as the correct role assignment often depends on the accuracy of the action prediction. We introduce, RoadSitu, a road situation recognition that involves generating a structured summary of what is happening in a road scenario using an action and the semantic roles played by agents from a video frame. The action can describe a diverse set of situations, and the same agent can play various roles depending on the situation depicted in the video frame. Therefore, a situation recognition model needs to understand the context of each video frame and the visual-linguistic meaning of the semantic roles of that particular frame. One of the main challenges in this work is the complex task of annotating video frames with semantic roles and handling the structured dependencies between the assigned roles (nouns) and the predicted action (activity). Additionally, the sparsity of meaningful semantic information within road scenarios poses further difficulties. To overcome these challenges, we introduce a novel approach where action recognition and noun estimation work together interactively to form structured summaries of each situation. In experiments using a road video dataset obtained from a South Korean company, RoadSitu achieved significant improvements across various performance metrics, with a Top-1 verb accuracy of 43.46%, Top-5 verb accuracy of 72.48%, and value accuracy of 34.21%, outperforming baseline models such as GSRTR and JSL by 2.4% and 3.86% in Top-1 verb accuracy, respectively. These results demonstrate the effectiveness of our model in handling complex road scenarios.

Published in Results in Engineering

ISSN: 2590-1230 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology
Website: https://www.journals.elsevier.com/results-in-engineering

About the journal

Abstract

Keywords