Conv-Former: A Novel Network Combining Convolution and Self-Attention for Image Quality Assessment

Lintao Han; Hengyi Lv; Yuchen Zhao; Hailong Liu; Guoling Bi; Zhiyong Yin; Yuqiang Fang

doi:10.3390/s23010427

Sensors (Dec 2022)

Conv-Former: A Novel Network Combining Convolution and Self-Attention for Image Quality Assessment

Lintao Han,
Hengyi Lv,
Yuchen Zhao,
Hailong Liu,
Guoling Bi,
Zhiyong Yin,
Yuqiang Fang

Affiliations

Lintao Han: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
Hengyi Lv: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
Yuchen Zhao: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
Hailong Liu: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
Guoling Bi: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
Zhiyong Yin: Department of Electrical and Optical Engineering, Space Engineering University, Beijing 101416, China
Yuqiang Fang: Department of Electrical and Optical Engineering, Space Engineering University, Beijing 101416, China

DOI: https://doi.org/10.3390/s23010427
Journal volume & issue: Vol. 23, no. 1
p. 427

Abstract

Read online

To address the challenge of no-reference image quality assessment (NR-IQA) for authentically and synthetically distorted images, we propose a novel network called the Combining Convolution and Self-Attention for Image Quality Assessment network (Conv-Former). Our model uses a multi-stage transformer architecture similar to that of ResNet-50 to represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. We employ adaptive learnable position embedding to handle images with arbitrary resolution. We propose a new transformer block (TB) by taking advantage of transformers to capture long-range dependencies, and of local information perception (LIP) to model local features for enhanced representation learning. The module increases the model’s understanding of the image content. Dual path pooling (DPP) is used to keep more contextual image quality information in feature downsampling. Experimental results verify that Conv-Former not only outperforms the state-of-the-art methods on authentic image databases, but also achieves competing performances on synthetic image databases which demonstrate the strong fitting performance and generalization capability of our proposed model.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords