IEEE Access (Jan 2019)

Learning Coarse and Fine Features for Precise Temporal Action Localization

  • Ji-Hwan Kim,
  • Jae-Pil Heo

DOI
https://doi.org/10.1109/ACCESS.2019.2946898
Journal volume & issue
Vol. 7
pp. 149797 – 149809

Abstract

Read online

Temporal action localization from untrimmed videos is a fundamental task for real-world computer vision applications such as video surveillance systems. Even though a great deal of research attention has been paid to the problem, precise localization of human activities at a frame level still remains as a challenge. In this paper, we propose CoarseFine networks that learn highly discriminative features without loss of time granularity with two streams: the coarse and fine networks. The coarse network aims to classify the action category based on the global context of a video by taking advantage of the description power of successful action recognition models. On the other hand, the fine network does not deploy temporal pooling constrained with a low channel capacity. The fine network is specialized to identify the per-frame location of actions based on local semantics. This approach enables CoarseFine networks to learn find-grained representations without any temporal information loss. Our extensive experiments on two challenging benchmarks, THUMOS14 and ActivityNet-v1.3, validate that our proposed method provides a higher accuracy compared to the state-of-the-art by a remarkable margin in per-frame labeling and temporal action localization tasks while the computational cost is significantly reduced.

Keywords