IEEE Access (Jan 2024)
BoneCLIP-XGBoost: A Multimodal Approach for Bone Fracture Diagnosis
Abstract
The integration of visual and textual data in medical diagnostics holds significant potential for improving the accuracy and reliability of clinical decision-making. Classical algorithms like CLIP, ConVIRT, and MedCLIP have made strides in this direction by leveraging multimodal data, but they face several challenges. These include the need for extensive labeled training data, difficulties in accurately aligning multimodal information, limited model interpretability, and ensuring robustness across diverse clinical settings. In response to these issues, we introduce BoneCLIP-XGBoost, a novel diagnostic model that combines the strengths of Vision Transformer (ViT) and ClinicalBERT for feature extraction with the powerful classification capabilities of XGBoost. By encoding X-ray images and textual descriptions into a unified feature space, BoneCLIP-XGBoost achieves superior alignment and integration of multimodal data. Our ablation study highlights the model’s exceptional performance, demonstrating an accuracy of 88.5%, precision of 87.3%, recall of 86.8%, and an F1 score of 87.0% when both image and text data are utilized. These results underscore the effectiveness of our approach in addressing the limitations of existing methods, providing a more accurate and reliable solution for medical diagnostics.
Keywords