IEEE Access (Jan 2024)
Improved DETR With Class Tokens in an Encoder
Abstract
DETR first used a transformer in object detection. It does not use anchor boxes and non-maximum suppression by converting object detection into a set prediction problem. DETR has shown competitive results on public datasets and brought many new ideas on object detection. Most DETR-like methods focus on improving decoder and object queries in the decoder part. We conclude that the backbone and the encoder comprising the DETR and DETR-like models serve as feature extractors through prior research. Through an analysis of the outputs from the backbone and the encoder, we notice that they extract image features for object detection. Based on this fact, we want to reinforce the feature extraction stage by introducing class tokens in the encoder. We add a class tokens module that represents prior category information in the encoder. It enables the utilization of global attention among feature tokens. This provides prior knowledge in feature extraction. We investigate two initialization methods in the proposed class token module: random initialization and pretrained class tokens. Also, the proposed module can be used as a plug-and-play component in DETR-like models. Experimental results show that the proposed module performs better than each baseline model.
Keywords