IEEE Access (Jan 2024)
A Bitrate-Scalable Variational Recurrent Mel-Spectrogram Coder for Real-Time Resynthesis-Based Speech Coding
Abstract
This paper introduces a method for real-time speech coding that combines a binary-latent-vector variational recurrent neural network for mel-spectrogram coding with a non-autoregressive convolutional vocoder for waveform reconstruction. To enable bitrate scalability, we propose a latent vector truncation and padding technique. We evaluate both fixed- and scalable-bitrate variants of the proposed method, comparing them to a baseline vector quantization-based coder. The method is also benchmarked against Opus, Lyra v2, EnCodec, and AudioDec using objective metrics and subjective ratings from a MUSHRA listening test. At 1.38 kbps, the proposed method significantly outperforms Lyra v2 at 3kbps and at 5.51kbps matches its performance at 6kbps. Although AudioDec significantly surpasses the proposed method at 6.4kbps on test data from the TSP speech dataset, the proposed method shows competitive or superior results on withheld speakers from the VCTK dataset. The results show that recurrent coding with binary latent vectors is a viable alternative to prevailing vector quantization-based approaches.
Keywords