CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating

Shanshan Li; Qingjie Liu; Zhian Pan; Xucheng Wu

doi:10.3390/app15158758

Applied Sciences (Aug 2025)

CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating

Shanshan Li,
Qingjie Liu,
Zhian Pan,
Xucheng Wu

Affiliations

Shanshan Li: School of Computer Science and Engineering, Institute of Disaster Prevention, Beijing 101601, China
Qingjie Liu: School of Computer Science and Engineering, Institute of Disaster Prevention, Beijing 101601, China
Zhian Pan: School of Computer Science and Engineering, Institute of Disaster Prevention, Beijing 101601, China
Xucheng Wu: School of Computer Science and Engineering, Institute of Disaster Prevention, Beijing 101601, China

DOI: https://doi.org/10.3390/app15158758
Journal volume & issue: Vol. 15, no. 15
p. 8758

Abstract

Read online

During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that integrates bidirectional cross-attention (Bi-Cross-Attention) and adaptive gating within the CLIP architecture to address these challenges. The Bi-Cross-Attention module enables fine-grained cross-modal semantic alignment, while the adaptive gating mechanism dynamically weights modalities to suppress noise. Hierarchical learning rate scheduling and multidimensional data augmentation further optimize feature fusion for real-time multiclass classification. On the CrisisMMD benchmark, CLIP-BCA-Gated achieves 91.77% classification accuracy (1.55% higher than baseline CLIP and 2.33% over state-of-the-art ALIGN), with exceptional recall for critical categories: infrastructure damage (93.42%) and rescue efforts (92.15%). The model processes tweets at 0.083 s per instance, meeting real-time deployment requirements for emergency response systems. Ablation studies show Bi-Cross-Attention contributes 2.54% accuracy improvement, and adaptive gating contributes 1.12%. This work demonstrates that dynamic multimodal fusion enhances resilience to noisy social media data, directly supporting SDG 11 through scalable real-time disaster information triage. The framework’s noise-robust design and sub-second inference make it a practical solution for humanitarian organizations requiring rapid crisis categorization.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords