IEEE Access (Jan 2024)

GalaxyGPT: A Hybrid Framework for Large Language Model Safety

  • Hange Zhou,
  • Jiabin Zheng,
  • Longtu Zhang

DOI
https://doi.org/10.1109/ACCESS.2024.3425662
Journal volume & issue
Vol. 12
pp. 94436 – 94451

Abstract

Read online

The challenge of balancing safety and utility in Large Language Models (LLMs) requires novel solutions that go beyond conventional methods of pre- and post-processing, red-teaming, and feedback fine-tuning. In response to this, we introduce GalaxyGPT, a framework that synergizes safety moderation services of Internet vendors with LLMs to enhance safety performance. This necessity arises from the growing complexity of online interactions and the imperative to ensure that LLMs operate within safe and ethical boundaries without compromising their utility. GalaxyGPT leverages advanced algorithms and a comprehensive dataset to significantly improve safety measures, achieving notable accuracy (95.8%) and F1-score (94.5%) through evaluations of our custom dataset comprising 500 single-round safety tests, 100 multi-round dialogue tests, and 200 open-source tests. These results starkly outperform the safety metrics of APIs from six vendors (average 40.5% accuracy) and LLMs without GalaxyGPT integration (73% accuracy). Additionally, we contribute to the community by releasing an open-source test set of 600 entries and a compact classification model for security tasks, specifically designed to challenge and enhance the robustness of APIs, thereby facilitating the efficient deployment and application of GalaxyGPT in diverse environments.

Keywords