IEEE Access (Jan 2021)

Semi-Supervised Temporal Segmentation of Manufacturing Work Video by Automatically Building a Hierarchical Tree of Category Labels

  • Kazuaki Nakamura,
  • Naoko Nitta,
  • Noboru Babaguchi,
  • Kensuke Fujii,
  • Satoki Matsumura,
  • Eiji Nabata

DOI
https://doi.org/10.1109/ACCESS.2021.3076849
Journal volume & issue
Vol. 9
pp. 68017 – 68027

Abstract

Read online

Nowadays, many industrial companies visually record workers’ activities for the purposes of streamlining their work processes. However, since untrimmed raw videos are hard to use, it is desired to automatically divide the videos into segments and recognize which kind of operation is performed on each segment. This task is called temporal video segmentation. We propose a method for achieving it, particularly targeting videos of manufacturing work with a specialized vehicle such as a hydraulic excavator. To make the performance of temporal video segmentation high, it is quite essential to extract good visual features from input videos. This can be hardly achieved by unsupervised methods, whereas supervised methods have another drawback that collecting a sufficient amount of training data is difficult due to its labor-intensiveness. To overcome these drawbacks, the proposed method employs a semi-supervised approach. We assume that a set of weakly-labeled videos whose frames only sparsely have a category label are given as input, where the labeled frames are used as training data to train a desirable feature extractor. Under this assumption, the proposed method first divides the input videos into short segments called primitive segments having the fixed length and then clusters them using visual features extracted by the above feature extractor. To achieve higher performance, we also use a hierarchical tree of the category labels and recursively perform the above process at each branch in the tree, where the tree is automatically built by the proposed method. In our experiments, we achieved a segmentation performance of 0.947 on the F-measure, even when only 1.25% of all the frames in the input videos are labeled.

Keywords