Autonomous Air Combat Maneuver Decision-Making Based on PPO-BWDA

Hongming Wang; Zhuangfeng Zhou; Junzhe Jiang; Wenqin Deng; Xueyun Chen

doi:10.1109/ACCESS.2024.3419889

IEEE Access (Jan 2024)

Autonomous Air Combat Maneuver Decision-Making Based on PPO-BWDA

Hongming Wang,
Zhuangfeng Zhou,
Junzhe Jiang,
Wenqin Deng,
Xueyun Chen

Affiliations

Hongming Wang: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Zhuangfeng Zhou: School of Electrical Engineering, Guangxi University, Nanning, China
Junzhe Jiang: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China
Wenqin Deng: School of Electrical Engineering, Guangxi University, Nanning, China
Xueyun Chen: ORCiD; School of Electrical Engineering, Guangxi University, Nanning, China

DOI: https://doi.org/10.1109/ACCESS.2024.3419889
Journal volume & issue: Vol. 12
pp. 119116 – 119132

Abstract

Read online

As Unmanned Combat Aerial Vehicle (UCAV) continue to play an increasingly pivotal role in modern aerial warfare, enhancing their intelligence levels is imperative for global military advancement. Despite notable progress in employing deep reinforcement learning for autonomous air combat maneuver decision-making, existing methods grapple with subpar performance, sluggish training, and susceptibility to local optima. Therefore, this paper proposes a new air combat maneuver decision algorithm based on Proximal Policy Optimization (PPO). Firstly, we establish a UCAV adversarial model and design a dual observation space. Secondly, we develop an Actor-Critic network based on Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Self-Attention (MHSA), which better handles high-dimensional information with temporal correlations in air combat situations. Thirdly, we propose an action selection method based on Parallel Monte Carlo Tree Search with Watch the Unobserved (WU-PMCTS) to assist the algorithm in making more effective maneuver decisions. Fourthly, we design a Dynamic Reward Evaluation (DRE) method to dynamically adjust the weights of various rewards according to different adversarial situations, improving algorithm performance. Finally, we introduce an Advantage Prioritized Experience Replay (APER) to sample according to the sample advantage values, enhancing algorithm training efficiency. Experimental results from ablation and comparative experiments demonstrate the superiority of the proposed algorithm over PPO and other mainstream algorithms, with a 0.32 increase in average return and a 36% increase in win rate.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords