IEEE Access (Jan 2021)
Pre-Trained-Based Individualization Model for Real-Time Spatial Audio Rendering System
Abstract
Spatial audio has attracted more and more attention in the fields of virtual reality (VR), blind navigation and so on. The individualized head-related transfer functions (HRTFs) play an important role in generating spatial audio with accurate localization perception. Existing methods only focus on one database, and do not fully utilize the information from multiple databases. In light of this, a pre-trained-based individualization model is proposed to predict HRTFs for any target user in this paper, and a real-time spatial audio rendering system built on a wearable device is implemented to produce an immersive virtual auditory display. The proposed method first builds a pre-trained model based on multiple databases using a DNN-based model combined with an autoencoder-based dimensional reduction method. This model can capture the nonlinear relationship between user-independent HRTFs and position-dependent features. Then, fine tuning is done using a transfer learning technique at a limit number of layers based on the pre-trained model. The key idea behind fine tuning is to transfer the pre-trained user-independent model to the user-dependent one based on anthropometric features. Finally, real-time issues are discussed to guarantee a fluent auditory experience during dynamic scene update, including fine-grained head-related impulse response (HRIR) acquisition, efficient spatial audio reproduction, and parallel synthesis and playback. These techniques ensure that the system is implemented with little computational cost, thus minimizing processing delay. The experimental results show that the proposed model outperforms other methods in terms of subjective and objective metrics. Additionally, our rendering system runs on HTC Vive, with almost unnoticeable delay.
Keywords