IEEE Access (Jan 2024)

Deep Multilevel Cascade Residual Recurrent Framework (MCRR) for Sheet Music Recognition

  • Ping Yu,
  • Hailing Chen

DOI
https://doi.org/10.1109/ACCESS.2024.3350880
Journal volume & issue
Vol. 12
pp. 6941 – 6960

Abstract

Read online

Sheet music recognition is a vital technology aimed at converting printed or handwritten musical scores into digital or machine-readable formats. The significance of this technology lies in making music compositions more accessible for editing, performance, learning, and sharing, thereby fostering music education, composition, and culture. It also provides a powerful tool for music analysis, research, and preservation. Our aim is to investigate a sheet music recognition method that offers a simple workflow, high recognition accuracy, and fast model convergence. Specifically, the proposed Deep Multilevel Cascade Residual Recurrent (MCRR) framework for sheet music recognition consists of the following components. Firstly, we introduce additive Gaussian white noise, additive Perlin noise, and elastic deformations such as rotation and stretching to simulate real-world noise in the sheet music images, thereby augmenting the dataset, enhancing model robustness, and mitigating overfitting. Secondly, in the feature extraction phase, we employ a residual Convolutional Neural Network (ConvNet) to address the issue of model degradation and use the multilevel cascade fusion technique to obtain comprehensive feature information, improving the model’s feature extraction capability and reducing recognition errors. For note recognition, we use a variant of RNN (Recurrent Neural Network) called SRU (Simple Recurrent Unit), which transforms most computations into parallel processing, speeding up model convergence. Finally, we combine the Connectionist Temporal Classification (CTC) loss function with SRU to eliminate the requirement for strict alignment between data and labels, enabling note classification and recognition. Extensive ablation experiments and comparative analyses, including visual analysis, intuitive illustrations, and quantitative assessments, confirm the effectiveness of the proposed method, demonstrating its superiority over various state-of-the-art methods. The proposed method achieved promising results in both the PrIMus and Camera-PrIMuS datasets. Specifically, in the PrIMus dataset, the method obtained an SeER (Symbol Error Rate) of 1.4571% and a SyER (System Error Rate) of 0.3234%. Notably, it demonstrated high accuracy in pitch, type, and note recognition, scoring approximately 97% in pitch and type accuracy and around 94% in note accuracy. The training time per epoch was relatively low, recorded at 0.56 seconds. In the case of the Camera-PrIMuS dataset, the method achieved slightly lower but still competitive results. It exhibited an SeER of 5.1488% and a SyER of 1.0612%, with pitch and type accuracies around 90%, and note accuracy at approximately 88%. The training time per epoch was slightly higher at 1.93 seconds Furthermore, we compare our method with existing commercial software, namely Capella-scan, PhotoScore, and SmartScore. Among these, Capella-scan delivers the best performance but exhibits lower robustness compared to the proposed method.

Keywords