SelfRemaster: Self-Supervised Speech Restoration for Historical Audio Resources

Takaaki Saeki; Shinnosuke Takamichi; Tomohiko Nakamura; Naoko Tanji; Hiroshi Saruwatari

doi:10.1109/ACCESS.2023.3345027

IEEE Access (Jan 2023)

SelfRemaster: Self-Supervised Speech Restoration for Historical Audio Resources

Takaaki Saeki,
Shinnosuke Takamichi,
Tomohiko Nakamura,
Naoko Tanji,
Hiroshi Saruwatari

Affiliations

Takaaki Saeki: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Shinnosuke Takamichi: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Tomohiko Nakamura: ORCiD; National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
Naoko Tanji: Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Hiroshi Saruwatari: ORCiD; Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan

DOI: https://doi.org/10.1109/ACCESS.2023.3345027
Journal volume & issue: Vol. 11
pp. 144831 – 144843

Abstract

Read online

Restoring high-quality speech from degraded historical recordings is crucial for the preservation of cultural and endangered linguistic resources. A key challenge in this task is the scarcity of paired training data that replicate the original acoustic conditions of the historical audio. While previous approaches have used pseudo paired data generated by applying various distortions to clean speech corpora, their limitations stem from the inability to authentically simulate the acoustic variations in historical recordings. We propose a self-supervised approach to speech restoration that does not require paired corpora. Our model has three main modules: analysis, synthesis, and channel modules, all of which are designed to emulate the recording process of degraded audio signals. The analysis module disentangles undistorted speech and distortion features, and the synthesis module generates the restored speech waveform. The channel module then introduces distortions into the speech waveform to compute the reconstruction loss between the input and output degraded speech signals. We further improve our model by introducing several methods including dual learning and semi-supervised learning. An additional feature of our model is the audio effect transfer, which allows acoustic distortions from degraded audio signals to be applied to arbitrary audio signals. Experimental evaluations demonstrated that our approach significantly outperforms the previous supervised approach for the restoration of real historical speech resources.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords