Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

Kodai Uchiyama; Kazuhiko Kawamoto

doi:10.1109/ACCESS.2021.3069267

IEEE Access (Jan 2021)

Audio-Visual Model for Generating Eating Sounds Using Food ASMR Videos

Kodai Uchiyama,
Kazuhiko Kawamoto

Affiliations

Kodai Uchiyama: Graduate School of Science and Engineering, Chiba University, Chiba, Japan
Kazuhiko Kawamoto: ORCiD; Graduate School of Engineering, Chiba University, Chiba, Japan

DOI: https://doi.org/10.1109/ACCESS.2021.3069267
Journal volume & issue: Vol. 9
pp. 50106 – 50111

Abstract

Read online

We present an audio-visual model for generating food texture sounds from silent eating videos. We designed a deep network-based model that takes the visual features of the detected faces as input and outputs a magnitude spectrogram that aligns with the visual streams. Generating raw waveform samples directly from a given input visual stream is challenging; in this study, we used the Griffin-Lim algorithm for phase recovery from the predicted magnitude to generate raw waveform samples using inverse short-time Fourier transform. Additionally, we produced waveforms from these magnitude spectrograms using an example-based synthesis procedure. To train the model, we created a dataset containing several food autonomous sensory meridian response videos. We evaluated our model on this dataset and found that the predicted sound features exhibit appropriate temporal synchronization with the visual inputs. Our subjective evaluation experiments demonstrated that the predicted sounds are considerably realistic to fool participants in a “real” or “fake” psychophysical experiment.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords