Results in Engineering (Dec 2024)
RoadSitu: Leveraging road video frame extraction and three-stage transformers for situation recognition
Abstract
Situation recognition is an crucial problem in scene understanding, activity understanding, and action reasoning as it provides a structured representation of the main activity depicted in the image. Semantic role labeling is crucial to situation recognition, which is challenging because a single action can have multiple meanings and purposes depending on its context. Understanding images beyond the highlighted actions requires inferences about the context of the scene, the objects, and their role in the captured event. Recently, situation recognition (SR) has been introduced, which jointly derives a collection of the action (activity), meaning-role, and noun (entities) pairs in the form of moving images. To label these frames as action frames, we must assign nouns (entities) to the role based on the content of the observed image. One of the main challenges is managing the complex dependencies between the assigned roles (nouns) and the predicted action, as the correct role assignment often depends on the accuracy of the action prediction. We introduce, RoadSitu, a road situation recognition that involves generating a structured summary of what is happening in a road scenario using an action and the semantic roles played by agents from a video frame. The action can describe a diverse set of situations, and the same agent can play various roles depending on the situation depicted in the video frame. Therefore, a situation recognition model needs to understand the context of each video frame and the visual-linguistic meaning of the semantic roles of that particular frame. One of the main challenges in this work is the complex task of annotating video frames with semantic roles and handling the structured dependencies between the assigned roles (nouns) and the predicted action (activity). Additionally, the sparsity of meaningful semantic information within road scenarios poses further difficulties. To overcome these challenges, we introduce a novel approach where action recognition and noun estimation work together interactively to form structured summaries of each situation. In experiments using a road video dataset obtained from a South Korean company, RoadSitu achieved significant improvements across various performance metrics, with a Top-1 verb accuracy of 43.46%, Top-5 verb accuracy of 72.48%, and value accuracy of 34.21%, outperforming baseline models such as GSRTR and JSL by 2.4% and 3.86% in Top-1 verb accuracy, respectively. These results demonstrate the effectiveness of our model in handling complex road scenarios.