IEEE Access (Jan 2024)
Aggregation-Based Perceptual Deep Model for Scenic Image Recognition
Abstract
In the artificial intelligence (AI) community, accurately interpreting the semantics of complex sceneries is a critical component across various systems. This paper introduces an effective pipeline that intelligently merges multi-channel perceptual visual features to identify scenic images with intricate spatial layouts. Our focus is on developing a deep hierarchical model that proactively identifies where human gaze is directed in a scenery. Overall, our method includes three key modules. First, we employ the BING objectness descriptor for the swift and precise localization of objects or their elements across multiple scales within a scene. Meanwhile, an algorithm for the local-global fusion of features is formulated to represent each BING patch, integrating various low-level attributes from different channels. Second, to mimic the human process of identifying semantically or visually significant patches within a scenery, we employee an active learning algorithm to localize those scenic patches that are semantically or visually salient. They further constitute the so-called Gaze Shift Path (GSP). Finally, an aggregation-guided deep neural network is designed to calculate the deep GSP features, which are subsequently applied to a multi-label SVM to distinguish among various scenic categories. Empirical evaluations reveal that our method’s categorization accuracy outperforms existing models on six generic scenic datasets by $2\%\sim 4.5\%$ . Besides, we observe a higher stability of our method according to the repetitive experiments. Furthermore, our method demonstrates exceptional discriminative power on a specially compiled sports educational image collection, wherein the accuracy exceeds the second best performer by 8%. These results showed the huge potential to computationally discover human gaze behavior in different visual recognition tasks.
Keywords