Journal of King Saud University: Computer and Information Sciences (Jul 2024)
Scalability and diversity of StarGANv2-VC in Arabic emotional voice conversion: Overcoming data limitations and enhancing performance
Abstract
Emotional Voice Conversion (EVC) for under-resourced languages like Arabic faces challenges due to limited emotional speech data. This study explored strategies to mitigate dataset scarcity and improve Arabic EVC performance. Fundamental experiments (Speaker-Dependent, Gender-Dependent, Gender-Independent) were conducted using the KSUEmotions dataset to analyze speaker, gender, and model impacts. Data augmentation techniques like time stretching and phase shuffling artificially increased data diversity. Attention mechanisms integrated into StarGANv2-VC aimed to better capture emotional cues. Transfer learning leveraged the larger English Emotional Speech Database (ESD) to enhance the Arabic system. A novel “Reordering Speaker-Emotion Data” approach treated each emotion as a separate speaker to expand the emotional variability. Our comprehensive approach, combining transfer learning, data augmentation, and architectural modifications, demonstrates the potential to overcome dataset limitations and enhance the performance of Arabic EVC systems.