DExter: Learning and Controlling Performance Expression with Diffusion Models

Huan Zhang; Shreyan Chowdhury; Carlos Eduardo Cancino-Chacón; Jinhua Liang; Simon Dixon; Gerhard Widmer

doi:10.3390/app14156543

Applied Sciences (Jul 2024)

DExter: Learning and Controlling Performance Expression with Diffusion Models

Huan Zhang,
Shreyan Chowdhury,
Carlos Eduardo Cancino-Chacón,
Jinhua Liang,
Simon Dixon,
Gerhard Widmer

Affiliations

Huan Zhang: Centre for Digital Music, Queen Mary University of London, London E1 4NS, UK
Shreyan Chowdhury: Institute of Computational Perception, Johannes Kepler Universität Linz, 4040 Linz, Austria
Carlos Eduardo Cancino-Chacón: Institute of Computational Perception, Johannes Kepler Universität Linz, 4040 Linz, Austria
Jinhua Liang: Centre for Digital Music, Queen Mary University of London, London E1 4NS, UK
Simon Dixon: Centre for Digital Music, Queen Mary University of London, London E1 4NS, UK
Gerhard Widmer: Institute of Computational Perception, Johannes Kepler Universität Linz, 4040 Linz, Austria

DOI: https://doi.org/10.3390/app14156543
Journal volume & issue: Vol. 14, no. 15
p. 6543

Abstract

Read online

In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. The main challenge faced in performance rendering tasks is the continuous and sequential modeling of expressive timing and dynamics over time, which is critical for capturing the evolving nuances that characterize live musical performances. In this approach, performance parameters are represented in a continuous expression space, and a diffusion model is trained to predict these continuous parameters while being conditioned on a musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by being jointly conditioned on score and perceptual-feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests that compare generated performances with different human interpretations. The results show that DExter is able to capture the time-varying correlation of the expressive parameters, and it compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified via a proxy model predicting perceptual characteristics of differently steered performances.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords