IEEE Access (Jan 2024)
Low Complexity In-Loop Filter for VVC Based on Convolution and Transformer
Abstract
The Joint Video Experts Team (JVET) has explored neural network-based video coding (NNVC) and is trying to introduce NNVC into the versatile video coding (VVC). In NNVC, the NN-based in-loop filter is the most active area, which is very close to deployment of software. Recent NN-based in-loop filters start adopting Transformer to capture context information, but it causes a remarkable increase of complexity to about 1000 kMAC/Pixel. In this paper, we propose a low complexity NN-based in-loop filter for VVC based on convolution and Transformer, named ConvTransNet. ConvTransNet adopts a pyramid structure in feature extraction to capture both global contextual information and local details at multiple scales. Moreover, ConvTransNet combines convolutional neural network (CNN) and Transformer into the in-loop filter. CNN captures local features and reduces compression artifacts in an image, while Transformer captures long-range spatial dependency and enhances global structures in an image. Thus, ConvTransNet enables the NN-based in-loop filter to reduce compression artifacts and enhance visual quality in an image. In ConvTransNet, we use grouped convolutions in CNN and depthwise convolutions in Transformer to reduce the network complexity. Therefore, ConvTransNet successfully captures both local spatial structure and global contextual information in an image and achieves outstanding performance in terms of BD-rate and complexity. Experimental results show that the proposed NN-based in-loop filter based on ConvTransNet achieves average {6.58%, 23.02%, 23.04%} and {8.18%, 22.67%, 22.00%} BD-rate reductions for {Y, U, V} channels over VTM_11.0-NNVC_2.0 anchor under AI and RA configurations, respectively.
Keywords