L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Riccardo F. Gramaccioni; Christian Marinoni; Changan Chen; Aurelio Uncini; Danilo Comminiello

doi:10.1109/ojsp.2024.3376297

IEEE Open Journal of Signal Processing (Jan 2024)

L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Riccardo F. Gramaccioni,
Christian Marinoni,
Changan Chen,
Aurelio Uncini,
Danilo Comminiello

Affiliations

Riccardo F. Gramaccioni: ORCiD; Sapienza University of Rome, Roma, Italy
Christian Marinoni: ORCiD; Sapienza University of Rome, Roma, Italy
Changan Chen: UT Austin, Austin, TX, USA
Aurelio Uncini: ORCiD; Sapienza University of Rome, Roma, Italy
Danilo Comminiello: ORCiD; Sapienza University of Rome, Roma, Italy

DOI: https://doi.org/10.1109/ojsp.2024.3376297
Journal volume & issue: Vol. 5
pp. 632 – 640

Abstract

Read online

The primary goal of the L3DAS (Learning 3D Audio Sources) project is to stimulate and support collaborative research studies concerning machine learning techniques applied to 3D audio signal processing. To this end, the L3DAS23 Challenge, presented at IEEE ICASSP 2023, focuses on two spatial audio tasks of paramount interest for practical uses: 3D speech enhancement (3DSE) and 3D sound event localization and detection (3DSELD). Both tasks are evaluated within augmented reality applications. The aim of this paper is to describe the main results obtained from this challenge. We provide the L3DAS23 dataset, which comprises a collection of first-order Ambisonics recordings in reverberant simulated environments. Indeed, we maintain some general characteristics of the previous L3DAS challenges, featuring a pair of first-order Ambisonics microphones to capture the audio signals and involving multiple-source and multiple-perspective Ambisonics recordings. However, in this new edition, we introduce audio-visual scenarios by including images that depict the frontal view of the environments as captured from the perspective of the microphones. This addition aims to enrich the challenge experience, giving participants tools for exploring a combination of audio and images for solving the 3DSE and 3DSELD tasks. In addition to a brand-new dataset, we provide updated baseline models designed to take advantage of audio-image pairs. To ensure accessibility and reproducibility, we also supply supporting API for an effortless replication of our results. Lastly, we present the results achieved by the participants of the L3DAS23 Challenge.

Published in IEEE Open Journal of Signal Processing

ISSN: 2644-1322 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8782710

About the journal

Abstract

Keywords