Complexity (Jan 2020)

Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

  • Yidi Li,
  • Hong Liu,
  • Bing Yang,
  • Runwei Ding,
  • Yang Chen

DOI
https://doi.org/10.1155/2020/3764309
Journal volume & issue
Vol. 2020

Abstract

Read online

For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Efficient and robust tracking is achieved both in 3D space and on image plane.