Безопасность информационных технологий (Dec 2021)
Spectrogram image encoding to provide variable audio data rates and preserve its sound quality
Abstract
In the applications of audio control and fixation in the conditions of information-technical counteraction, noise clearing, formation of digital watermarks, audio fingerprinting, protective text audio markers, etc., a compact representation of speech signals for subsequent transmission-storage is required while maximal preserving the similarity of the sound quality of restored speech with the original, elimination of accompanying interferences. Theproposed audio codec is based on the narrow-band sine Gaussian model of speech analysis/synthesis, where its representation as a superposition of harmonic components weighted by a Gaussian window applies to all types of speech frames, as well as on universal and special methods of construction and image processing of narrow-band dynamic spectrograms, in particular, by the application of compression-recovery algorithms to them, which will allow to regulate the speech stream speed within a wide range of 1.2–16Kbit/s with adaptation to changes of the audio data transmission-storage channel bandwidth, caused, in particular, by both objective factors and the actions of an intruder. This work aims to select the best parameters on the spectrogram images that reduce the overall bitrate, remove the influence of noise and interference and allow using of spectral inversion methods and algorithms to recover the speech signal with the same or better quality. The parameters are extracted from the spectrogram images obtained using of the short-time Fourier transform, using methods to extract the amplitudes, frequencies, phases and development tracks of selected local or global maxima (peaks) of the speech signal on the spectral slices. The communication channel can transmit either the parameters themselves, or the results of compression-encoding of the image to restore the image of the original spectrogram with the selection of peak parameters already on it with the subsequent synthesis of speech or for direct spectral inversion of the image into speech. It is possible to correct the reconstructed spectrogram by using a priori information about the speaker's speech from his pre-generated voice database.
Keywords