Video Relationship Detection Using Mixture of Experts

Ala Shaabana; Zahra Gharaee; Paul Fieguth

doi:10.1109/ACCESS.2023.3257280

IEEE Access (Jan 2023)

Video Relationship Detection Using Mixture of Experts

Ala Shaabana,
Zahra Gharaee,
Paul Fieguth

Affiliations

Ala Shaabana: Department of Systems Design Engineering, Vision and Image Processing Laboratory (VIP), University of Waterloo, Waterloo, Canada
Zahra Gharaee: ORCiD; Department of Systems Design Engineering, Vision and Image Processing Laboratory (VIP), University of Waterloo, Waterloo, Canada
Paul Fieguth: Department of Systems Design Engineering, Vision and Image Processing Laboratory (VIP), University of Waterloo, Waterloo, Canada

DOI: https://doi.org/10.1109/ACCESS.2023.3257280
Journal volume & issue: Vol. 11
pp. 26048 – 26058

Abstract

Read online

Machine comprehension of visual information from images and videos by neural networks suffers from two limitations: (1) the computational and inference gap in vision and language to accurately determine which object a given agent acts on and then to represent it by language, and (2) the shortcoming in stability and generalization of the classifier trained by a single, monolithic neural network. To address these limitations, we propose MoE-VRD, a novel approach to visual relationship detection via a mixture of experts. MoE-VRD recognizes language triplets in the form of a < subject, predicate, object > tuple to extract the relationship between subject, predicate, and object from visual processing. Since detecting a relationship between a subject (acting) and the object(s) (being acted upon) requires that the action be recognized, we base our network on recent work in visual relationship detection. To address the limitations associated with single monolithic networks, our mixture of experts is based on multiple small models, whose outputs are aggregated. That is, each expert in MoE-VRD is a visual relationship learner capable of detecting and tagging objects. MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity. We show that the conditional computation capabilities and massive ability to scale the mixture-of-experts leads to an approach to the visual relationship detection problem which outperforms the state-of-the-art.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords