Applied Sciences (Aug 2025)
CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating
Abstract
During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that integrates bidirectional cross-attention (Bi-Cross-Attention) and adaptive gating within the CLIP architecture to address these challenges. The Bi-Cross-Attention module enables fine-grained cross-modal semantic alignment, while the adaptive gating mechanism dynamically weights modalities to suppress noise. Hierarchical learning rate scheduling and multidimensional data augmentation further optimize feature fusion for real-time multiclass classification. On the CrisisMMD benchmark, CLIP-BCA-Gated achieves 91.77% classification accuracy (1.55% higher than baseline CLIP and 2.33% over state-of-the-art ALIGN), with exceptional recall for critical categories: infrastructure damage (93.42%) and rescue efforts (92.15%). The model processes tweets at 0.083 s per instance, meeting real-time deployment requirements for emergency response systems. Ablation studies show Bi-Cross-Attention contributes 2.54% accuracy improvement, and adaptive gating contributes 1.12%. This work demonstrates that dynamic multimodal fusion enhances resilience to noisy social media data, directly supporting SDG 11 through scalable real-time disaster information triage. The framework’s noise-robust design and sub-second inference make it a practical solution for humanitarian organizations requiring rapid crisis categorization.
Keywords