EURASIP Journal on Audio, Speech, and Music Processing (Apr 2025)

MLAT: a multi-level attention transformer capturing multi-level information in compound word encoding for symbolic music generation

  • Lianyu Zhou,
  • Liang Yin,
  • Yukun Qian,
  • Mingjiang Wang

DOI
https://doi.org/10.1186/s13636-025-00407-4
Journal volume & issue
Vol. 2025, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Compound word encoding is a widely used method in symbolic music generation tasks. It combines multiple different tokens into a super token, with each time step containing one super token, thereby transforming a long music sequence into multiple shorter subsequences. By applying word embeddings, concatenation, and projection, these subsequences are converted into inputs suitable for neural sequence models, known as compressed representations. Previous research related to compound word encoding has used compressed representations to learn music features directly, but it has not explicitly modeled the relationships between the individual tokens that compose the super tokens (independent representations). In this study, to model both compressed and independent representations, we propose a new multi-level attention transformer model (MLAT). MLAT consists of three main modules: compressed representation modeling (CRM), independent representation modeling (IRM), and feature interaction modeling (FIM). CRM learns the relationships between super tokens by modeling compressed representations, IRM learns the relationships between individual tokens within super tokens by modeling independent representations, and FIM facilitates feature interaction between CRM and IRM. Furthermore, we introduce a new compressive compound word (CCP) encoding method, which significantly reduces the number of ignore tokens in super tokens, thus decreasing the length of independent representations, and further reducing the time steps needed to encode a musical piece, thus decreasing the length of compressed representations. Experimental results show that MLAT can effectively capture multi-level information in compound word encoding, thereby improving the quality of generated music. CCP better models pitch diversity, rhythmic consistency, and chord structure and outperforms mainstream compound word encodings in terms of average inference time when tested on the baseline model.

Keywords