Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Oct 2024)
Low-complexity multi task learning for joint acoustic scenes classification and sound events detection
Abstract
The task of automatic metainformation recognition from audio sources is to detect and extract data of various natures (speech, noises, acoustic scenes, acoustic events, anomalies) from a given audio input signal. This area is well developed and known to the scientific community and has various approaches with high quality. But, the vast majority of such methods are based on large neural networks with a huge number of weights to be trained. Subsequently, it is impractical to use them in environments with severely limited computing resources. The smart device industry is currently growing rapidly: smartphones, smart watches, voice assistants, TV, smart home. Such products have limitations in both processor and memory. At that moment, the State-of-the-Art way to cope with these conditions is to use so-called low-complexity models. Moreover, in recent years, the interest of the scientific community in the above-mentioned problem has been growing (DCASE Workshop). One of the most crucial subtasks in the global meta information recognition problem is the task of Automatic Scene Classification and the task of Sound Event Detection. The most important scientific questions are the development of both the optimal low-complexity neural network architecture and learning algorithms to obtain a low-resource, high-quality system for classifying acoustic scenes and detecting sound events. In this paper the datasets from DCASE Challenge “Low-Complexity Acoustic Scene Classification” and “Sound Event Detection with Weak Labels and Synthetic Soundscapes” were used. A multitask neural network architecture was proposed consisting of a common encoder and two independent decoders for each of the two tasks. The classical algorithms of multitask learning SoftMTL and HardMTL were considered, and their modifications were developed: CrossMTL, which is based on the idea of reusing data from one task when training the decoder to solve the second task, and FreezeMTL, in which the trained weights of the common encoder are frozen after training on the first task and used to optimize the second decoder. As a result of the experiments, it was shown that the use of the CrossMTL modification can significantly increase the accuracy of the classification of acoustic scenes and event detection in compare with classical approaches SoftMTL and HardMTL. The FreezeMTL algorithm made it possible to obtain a model that provides 42.44 % accuracy in scene classification and 45.86 % accuracy in event detection, which is comparable to the results of the baseline solutions of 2023. In this paper, a low-complexity neural network consisting of 633.5 K trainable parameters was proposed, requiring 43.2 M MACs to process one second audio. This approach uses 7.8 % fewer trainable parameters and 40 % fewer MACs compared to the naive application of two independent models. The developed model can be used in smart devices due to a small number of trainable parameters, as well as a small number of MACs required for its application.
Keywords