ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation

Haitao Dong; Chengjun Chen; Jinlei Wang; Feixiang Shen; Yong Pang

doi:10.1109/ACCESS.2023.3270807

IEEE Access (Jan 2023)

ViT-SAPS: Detail-Aware Transformer for Mechanical Assembly Semantic Segmentation

Haitao Dong,
Chengjun Chen,
Jinlei Wang,
Feixiang Shen,
Yong Pang

Affiliations

Haitao Dong: ORCiD; School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China
Chengjun Chen: ORCiD; School of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, China
Jinlei Wang: ORCiD; School of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, China
Feixiang Shen: ORCiD; School of Mechanical and Automotive Engineering, Qingdao University of Technology, Qingdao, China
Yong Pang: ORCiD; School of Information and Control Engineering, Qingdao University of Technology, Qingdao, China

DOI: https://doi.org/10.1109/ACCESS.2023.3270807
Journal volume & issue: Vol. 11
pp. 41467 – 41479

Abstract

Read online

Semantic segmentation of mechanical assembly images provides an effective way to monitor the assembly process and improve the product quality. Compared with other deep learning models, Transformer has advantages in modeling global context, and it has been widely applied in various computer vision tasks including semantic segmentation. However, Transformer pays the same granularity of attention on all the regions of an image, so it has some difficulty to be applied to the semantic segmentation of mechanical assembly images, in which mechanical parts have large size differences and the information quantity distribution is uneven. This paper proposes a novel Transformer-based model called Vision Transformer with Self-Adaptive Patch Size (ViT-SAPS). ViT-SAPS can perceive the detail information in an image and pays finer-grained attention on the regions where the detail information locates, thus meeting the requirements of mechanical assembly semantic segmentation. Specifically, a self-adaptive patch splitting algorithm is proposed to split an image into patches of various sizes. The more detail information an image region has, the smaller patches it is split into. Further, to handle these unfixed-size patches, a position encoding scheme and a non-uniform bilinear interpolation algorithm used after sequence decoding are proposed. Experimental results show that ViT-SAPS has stronger detail segmentation ability than the model with fixed patch size, and achieves an impressive locality-globality trade-off. This study not only provides a practical method for mechanical assembly semantic segmentation, but also has much value for the application of vision Transformers in other fields. The code is available at: https://github.com/QDLGARIM/ViT-SAPS.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords