A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition

Zijun Pu; Qunfei Zhang; Yangtao Xue; Peican Zhu; Xiaodong Cui

doi:10.3390/rs16132442

Remote Sensing (Jul 2024)

A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition

Zijun Pu,
Qunfei Zhang,
Yangtao Xue,
Peican Zhu,
Xiaodong Cui

Affiliations

Zijun Pu: School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
Qunfei Zhang: School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
Yangtao Xue: School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
Peican Zhu: School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China
Xiaodong Cui: School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

DOI: https://doi.org/10.3390/rs16132442
Journal volume & issue: Vol. 16, no. 13
p. 2442

Abstract

Read online

Although recent data-driven Underwater Acoustic Target Recognition (UATR) methods have played a dominant role in marine acoustics, they suffer from complex ocean environments and rather small datasets. To tackle such challenges, researchers have resorted to transfer learning in an effort to fulfill UATR tasks. However, existing pre-trained models are trained on audio speech data, and are not suitable for underwater acoustic data. Therefore, it is necessary to make further optimization on the basis of these models to make them suitable for the UATR task. Here, we propose a novel UATR framework called Attention Layer Supplement Integration (ALSI), which integrates large pre-trained neural networks with customized attention modules for acoustic. Specifically, the ALSI model consists of two important modules, namely Scale ResNet and Residual Hybrid Attention Fusion (RHAF). First, the Scale ResNet module takes the Constant-Q transform feature as input to obtain relatively important frequency information. Next, RHAF takes the temporal feature extracted by wav2vec 2.0 and the frequency feature extracted by Scale ResNet as input and aims to better integrate the time–frequency features with the temporal feature by using the attention mechanism. The RHAF module can help wav2vec 2.0, which is trained on speech data, to better adapt to underwater acoustic data. Finally, the experiments on the ShipsEar dataset demonstrated that our model can achieve recognition accuracy of 96.39%. In conclusion, extensive experiments confirm the effectiveness of our model on the UATR task.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords