Automated audio captioning: an overview of recent progress and new challenges

Xinhao Mei; Xubo Liu; Mark D. Plumbley; Wenwu Wang

doi:10.1186/s13636-022-00259-2

EURASIP Journal on Audio, Speech, and Music Processing (Oct 2022)

Automated audio captioning: an overview of recent progress and new challenges

Xinhao Mei,
Xubo Liu,
Mark D. Plumbley,
Wenwu Wang

Affiliations

Xinhao Mei: Centre for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey
Xubo Liu: Centre for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey
Mark D. Plumbley: Centre for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey
Wenwu Wang: Centre for Vision Speech and Signal Processing (CVSSP), Department of Electrical and Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey

DOI: https://doi.org/10.1186/s13636-022-00259-2
Journal volume & issue: Vol. 2022, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. We also discuss open challenges and envisage possible future research directions.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords