Machine Learning with Applications (Dec 2024)
VLFSE: Enhancing visual tracking through visual language fusion and state update evaluator
Abstract
Recently, visual tracking algorithms have achieved impressive results by combining dynamic templates. However, the instability of visual images and the incorrect timing of template updates lead to decreased tracking accuracy and stability in intricate scenarios. To address these issues, we propose a visual tracking algorithm through visual language fusion and a state update evaluator (VLFSE). Specifically, our approach introduces a multimodal attention mechanism that uses self-attention to mine and integrate information from diverse sources effectively. This mechanism ensures a richer, context-aware representation of the target, enabling more accurate tracking even in complex scenes. Moreover, we recognize the critical need for precise template updates to maintain tracking accuracy over time. To this end, we develop a state update evaluator, a component trained online to assess the necessity and timing of template updates accurately. This evaluator acts as a safeguard, preventing erroneous updates and ensuring the tracker adapts optimally to changes in the target’s appearance. The experimental results on challenging visual language tracking datasets demonstrate our tracker’s superior performance, showcasing its adaptability and accuracy in complex tracking scenarios.