IEEE Access (Jan 2024)
Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models
Abstract
Large Language Models (LLMs) excel in interactive chat scenarios due to their advanced conversational abilities. However, their training process invariably exposes them to a diverse range of harmful or toxic content, posing significant challenges in ensuring that LLM responses align with human ethical values. Consequently, the detection and quantification of adverse content remains a paramount issue in contemporary research. In this paper, we introduce the SAFE dataset, a novel resource designed to advance safety assessment research in LLMs. Our dataset extends beyond the binary categorization of content into “safe” and “unsafe”. Drawing upon human interpretations of safety, we further delineate unsafe content into six granular categories: Sensitivity, Harmfulness, Falsehood, Information Corruption, Unnaturalness, and Deviation from Instructions. This refined classification aims to enhance LLMs’ ability to discern unsafe data more accurately. In total, we have created a dataset comprising 52,340 instruction-response pairs, each annotated with safety meta-tags. Additionally, we have compiled expert comparative assessments for these indicators. We developed a multi-expert rating model trained on the SAFE dataset, designed to evaluate the responses of LLMs across various dimensions. This approach highlights the potential of our dataset in the realm of safety assessment for LLMs. The model’s capability to provide multi-faceted evaluations reflects an advanced understanding of the nuanced requirements in LLM response assessment. We believe this dataset represents a valuable resource for the community, contributing to the safe development and deployment of LLMs. Our findings and resources are poised to fuel future research endeavors in this domain.
Keywords