MedTrans: Intelligent Computing for Medical Diagnosis Using Multiscale Cross-Attention Vision Transformer

Yang Xu; Yuan Hong; Xinchen Li; Mu Hu

doi:10.1109/ACCESS.2024.3450121

IEEE Access (Jan 2024)

MedTrans: Intelligent Computing for Medical Diagnosis Using Multiscale Cross-Attention Vision Transformer

Yang Xu,
Yuan Hong,
Xinchen Li,
Mu Hu

Affiliations

Yang Xu: Department of Orthopedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Yuan Hong: Department of Orthopedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Xinchen Li: Department of Orthopedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Mu Hu: ORCiD; Department of Orthopedics, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China

DOI: https://doi.org/10.1109/ACCESS.2024.3450121
Journal volume & issue: Vol. 12
pp. 146575 – 146586

Abstract

Read online

Vision transformer (ViT) has outperformed conventional neural networks (CNNs) on general image classification. Motivated by this, we explore the ViT for Oral Squamous Cell Carcinoma (OSCC) detection from Histopathological Images. Such medical image understanding requires information from multiple spatial resolutions. There, we propose a multiscale transformer to process the information from image patch tokens of variable scales to extract the fine-grained and coarse-grained features. Our transformer model design is based on two branches, a small branch (i.e., small sized patch tokens) and large branch (i.e., large sized patch tokens) where each branch is processed with a separate specialized encoder to represent local and global context information from multiscale image patch tokens, and multi-head cross-attention fusion with lateral connections for information fusion across scales. This information Our ablation shows that MedTrans continuously perform better as patch size becomes smaller and smaller. We present a comprehensive comparison of our model that shows that our model has performed better as compared to different vision transformers and state-of-the-art CNN models on the OSCC dataset. For example, MedTrans-S outperforms the recently proposed CNN-Transformer model named TransPath with Top-1 Acc +3.58% and F1-score +3.91%, and best performing CNN model, EfficientNet, with Top-1 Acc +8.30% and F1-score +7.23%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords