Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

Katerina Zmolikova; Michael Syskind Pedersen; Jesper Jensen

doi:10.1109/OJSP.2023.3343343

IEEE Open Journal of Signal Processing (Jan 2024)

Masked Spectrogram Prediction for Unsupervised Domain Adaptation in Speech Enhancement

Katerina Zmolikova,
Michael Syskind Pedersen,
Jesper Jensen

Affiliations

Katerina Zmolikova: ORCiD; Demant A/S, Smorum, Denmark
Michael Syskind Pedersen: ORCiD; Demant A/S, Smorum, Denmark
Jesper Jensen: ORCiD; Demant A/S, Smorum, Denmark

DOI: https://doi.org/10.1109/OJSP.2023.3343343
Journal volume & issue: Vol. 5
pp. 274 – 283

Abstract

Read online

Supervised learning-based speech enhancement methods often work remarkably well in acoustic situations represented in the training corpus but generalize poorly to out-of-domain situations, i.e. situations not seen during training. This stands in the way of further improvement of these methods in realistic scenarios, as collecting paired noisy-clean recordings in the target application domain is typically not feasible. Recording noisy-only in-domain data is, though, much more practical. In this article, we tackle the problem of unsupervised domain adaptation in speech enhancement. Specifically, we propose a way to use in-domain noisy-only data in the training of a neural network to improve upon a model trained solely on out-of-domain paired data. For this, we make use of masked spectrogram prediction, a technique from self-supervised learning that aims to interpolate masked regions of a spectrogram. We hypothesize that masked spectrogram prediction encourages learning of features that represent well both speech and noise components of the noisy signals. These features can then be used to train a more robust speech enhancement system. We evaluate the proposed method on the VoiceBank-DEMAND and LibriFSD50k databases, with WSJ0-CHiME3 serving as the out-of-domain database. We show that the proposed method outperforms both the out-of-domain system and the baseline approaches, i.e. RemixIT and noisy-target training, and also combines well with the previously proposed RemixIT method.

Published in IEEE Open Journal of Signal Processing

ISSN: 2644-1322 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8782710

About the journal

Abstract

Keywords