On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia; Eugene Kharitonov; Wei-Ning Hsu; Yossi Adi; Adam Polyak; Benjamin Bolte; Tu-Anh Nguyen; Jade Copet; Alexei Baevski; Abdelrahman Mohamed; Emmanuel Dupoux

doi:10.1162/tacl_a_00430

Transactions of the Association for Computational Linguistics (Jan 2021)

On Generative Spoken Language Modeling from Raw Audio

Kushal Lakhotia,
Eugene Kharitonov,
Wei-Ning Hsu,
Yossi Adi,
Adam Polyak,
Benjamin Bolte,
Tu-Anh Nguyen,
Jade Copet,
Alexei Baevski,
Abdelrahman Mohamed,
Emmanuel Dupoux

Affiliations

Kushal Lakhotia: Facebook AI Research. [email protected]
Eugene Kharitonov: Facebook AI Research. [email protected]
Wei-Ning Hsu: Facebook AI Research. [email protected]
Yossi Adi: Facebook AI Research. [email protected]
Adam Polyak: Facebook AI Research. [email protected]
Benjamin Bolte: Facebook AI Research. § Work done while at FAIR. [email protected]
Tu-Anh Nguyen: Facebook AI Research. ‡ Also at EHESS. [email protected]
Jade Copet: Facebook AI Research. [email protected]
Alexei Baevski: Facebook AI Research. [email protected]
Abdelrahman Mohamed: Facebook AI Research. [email protected]
Emmanuel Dupoux: Facebook AI Research. † Also at INRIA. [email protected]

DOI: https://doi.org/10.1162/tacl_a_00430
Journal volume & issue: Vol. 9
pp. 1336 – 1354

Abstract

Read online

AbstractWe introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.1

Published in Transactions of the Association for Computational Linguistics

ISSN: 2307-387X (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/tacl

About the journal