Applied Sciences (May 2022)
Multi-Head TrajectoryCNN: A New Multi-Task Framework for Action Prediction
Abstract
Action prediction is an important task in human activity analysis, which has many practical applications, such as human–robot interactions and autonomous driving. Action prediction often comprises two subtasks: action semantic prediction and future human motion prediction. Most of the existing works treat these subtasks separately, ignoring the correlations, leading to unsatisfying performance. By contrast, we jointly model these tasks and improve human motion predictions utilizing their action semantics. In terms of methodology, we propose a novel multi-task framework (Multi-head TrajectoryCNN) to simultaneously predict the action semantics and human motion of future human movements. Specifically, we first extract a general spatiotemporal representation of partial observations via two regression blocks. Then, we propose a regression head and a classification head for predicting future human motion and action semantics of human motion, respectively. For the regression head, another two stacked regression blocks and two convolutional layers are applied to predict future poses from the general representation learning. For the classification head, we propose a classification block and stack two regression blocks to predict action semantics from the general representation. In this way, the regression and classification heads are incorporated into a unified framework. During the backward propagation of the network, the human motion prediction and the semantic prediction may be enhanced by each other. NTU RGB+D is a widely used large-scale dataset for action recognition, which was collected by 40 different subjects from three views. Based on the official protocols, we use the skeletal modality and process action sequences with fixed lengths for the evaluation of our action prediction task. Experiments on NTU RGB+D show our model’s state-of-the-art performance. Furthermore, the experimental results also show that semantic information is of great help in predicting future human motion.
Keywords