Transactions of the International Society for Music Information Retrieval (Jun 2020)

Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

  • Adrien Ycart,
  • Lele Liu,
  • Emmanouil Benetos,
  • Marcus T. Pearce

DOI
https://doi.org/10.5334/tismir.57
Journal volume & issue
Vol. 3, no. 1

Abstract

Read online

Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the number of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners’ quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics.

Keywords