IEEE Access (Jan 2024)
MedTrans: Intelligent Computing for Medical Diagnosis Using Multiscale Cross-Attention Vision Transformer
Abstract
Vision transformer (ViT) has outperformed conventional neural networks (CNNs) on general image classification. Motivated by this, we explore the ViT for Oral Squamous Cell Carcinoma (OSCC) detection from Histopathological Images. Such medical image understanding requires information from multiple spatial resolutions. There, we propose a multiscale transformer to process the information from image patch tokens of variable scales to extract the fine-grained and coarse-grained features. Our transformer model design is based on two branches, a small branch (i.e., small sized patch tokens) and large branch (i.e., large sized patch tokens) where each branch is processed with a separate specialized encoder to represent local and global context information from multiscale image patch tokens, and multi-head cross-attention fusion with lateral connections for information fusion across scales. This information Our ablation shows that MedTrans continuously perform better as patch size becomes smaller and smaller. We present a comprehensive comparison of our model that shows that our model has performed better as compared to different vision transformers and state-of-the-art CNN models on the OSCC dataset. For example, MedTrans-S outperforms the recently proposed CNN-Transformer model named TransPath with Top-1 Acc +3.58% and F1-score +3.91%, and best performing CNN model, EfficientNet, with Top-1 Acc +8.30% and F1-score +7.23%.
Keywords