IEEE Open Journal of Signal Processing (Jan 2024)

L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

  • Riccardo F. Gramaccioni,
  • Christian Marinoni,
  • Changan Chen,
  • Aurelio Uncini,
  • Danilo Comminiello

DOI
https://doi.org/10.1109/OJSP.2024.3376297
Journal volume & issue
Vol. 5
pp. 632 – 640

Abstract

Read online

The primary goal of the L3DAS (Learning 3D Audio Sources) project is to stimulate and support collaborative research studies concerning machine learning techniques applied to 3D audio signal processing. To this end, the L3DAS23 Challenge, presented at IEEE ICASSP 2023, focuses on two spatial audio tasks of paramount interest for practical uses: 3D speech enhancement (3DSE) and 3D sound event localization and detection (3DSELD). Both tasks are evaluated within augmented reality applications. The aim of this paper is to describe the main results obtained from this challenge. We provide the L3DAS23 dataset, which comprises a collection of first-order Ambisonics recordings in reverberant simulated environments. Indeed, we maintain some general characteristics of the previous L3DAS challenges, featuring a pair of first-order Ambisonics microphones to capture the audio signals and involving multiple-source and multiple-perspective Ambisonics recordings. However, in this new edition, we introduce audio-visual scenarios by including images that depict the frontal view of the environments as captured from the perspective of the microphones. This addition aims to enrich the challenge experience, giving participants tools for exploring a combination of audio and images for solving the 3DSE and 3DSELD tasks. In addition to a brand-new dataset, we provide updated baseline models designed to take advantage of audio-image pairs. To ensure accessibility and reproducibility, we also supply supporting API for an effortless replication of our results. Lastly, we present the results achieved by the participants of the L3DAS23 Challenge.

Keywords