Sensors (May 2025)
Product Engagement Detection Using Multi-Camera 3D Skeleton Reconstruction and Gaze Estimation
Abstract
Product engagement detection in retail environments is critical for understanding customer preferences through nonverbal cues such as gaze and hand movements. This study presents a system leveraging a 360-degree top-view fisheye camera combined with two perspective cameras, the only sensors required for deployment, effectively capturing subtle interactions even under occlusion or distant camera setups. Unlike conventional image-based gaze estimation methods that are sensitive to background variations and require capturing a person’s full appearance, raising privacy concerns, our approach utilizes a novel Transformer-based encoder operating directly on 3D skeletal keypoints. This innovation significantly reduces privacy risks by avoiding personal appearance data and benefits from ongoing advancements in accurate skeleton estimation techniques. Experimental evaluation in a simulated retail environment demonstrates that our method effectively identifies critical gaze-object and hand-object interactions, reliably detecting customer engagement prior to product selection. Despite yielding slightly higher mean angular errors in gaze estimation compared to a recent image-based method, the Transformer-based model achieves comparable performance in gaze-object detection. Its robustness, generalizability, and inherent privacy preservation make it particularly suitable for deployment in practical retail scenarios such as convenience stores, supermarkets, and shopping malls, highlighting its superiority in real-world applicability.
Keywords