IEEE Open Journal of Signal Processing (Jan 2024)

Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

  • Katerina Zmolikova,
  • Michael Syskind Pedersen,
  • Jesper Jensen

DOI
https://doi.org/10.1109/OJSP.2023.3343343
Journal volume & issue
Vol. 5
pp. 274 – 283

Abstract

Read online

Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.

Keywords