Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Rajat Hebbar; Pavlos Papadopoulos; Ramon Reyes; Alexander F. Danvers; Angelina J. Polsinelli; Suzanne A. Moseley; David A. Sbarra; Matthias R. Mehl; Shrikanth Narayanan

doi:10.1186/s13636-020-00194-0

EURASIP Journal on Audio, Speech, and Music Processing (Feb 2021)

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Rajat Hebbar,
Pavlos Papadopoulos,
Ramon Reyes,
Alexander F. Danvers,
Angelina J. Polsinelli,
Suzanne A. Moseley,
David A. Sbarra,
Matthias R. Mehl,
Shrikanth Narayanan

Affiliations

Rajat Hebbar: Signal Analysis and Interpretation Laboratory
Pavlos Papadopoulos: Signal Analysis and Interpretation Laboratory
Ramon Reyes: Department of Psychology
Alexander F. Danvers: Department of Psychology
Angelina J. Polsinelli: Department of Neurology
Suzanne A. Moseley: Department of Psychology
David A. Sbarra: Department of Psychology
Matthias R. Mehl: Department of Psychology
Shrikanth Narayanan: Signal Analysis and Interpretation Laboratory

DOI: https://doi.org/10.1186/s13636-020-00194-0
Journal volume & issue: Vol. 2021, no. 1
pp. 1 – 8

Abstract

Read online

Abstract Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords