Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models

Jia Yu; Long Li; Zhenzhong Lan

doi:10.1109/ACCESS.2024.3393245

IEEE Access (Jan 2024)

Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models

Jia Yu,
Long Li,
Zhenzhong Lan

Affiliations

Jia Yu: ORCiD; School of Engineering, Zhejiang University, Hangzhou, China
Long Li: School of Engineering, Zhejiang University, Hangzhou, China
Zhenzhong Lan: School of Engineering, Zhejiang University, Hangzhou, China

DOI: https://doi.org/10.1109/ACCESS.2024.3393245
Journal volume & issue: Vol. 12
pp. 64717 – 64726

Abstract

Read online

Large Language Models (LLMs) excel in interactive chat scenarios due to their advanced conversational abilities. However, their training process invariably exposes them to a diverse range of harmful or toxic content, posing significant challenges in ensuring that LLM responses align with human ethical values. Consequently, the detection and quantification of adverse content remains a paramount issue in contemporary research. In this paper, we introduce the SAFE dataset, a novel resource designed to advance safety assessment research in LLMs. Our dataset extends beyond the binary categorization of content into “safe” and “unsafe”. Drawing upon human interpretations of safety, we further delineate unsafe content into six granular categories: Sensitivity, Harmfulness, Falsehood, Information Corruption, Unnaturalness, and Deviation from Instructions. This refined classification aims to enhance LLMs’ ability to discern unsafe data more accurately. In total, we have created a dataset comprising 52,340 instruction-response pairs, each annotated with safety meta-tags. Additionally, we have compiled expert comparative assessments for these indicators. We developed a multi-expert rating model trained on the SAFE dataset, designed to evaluate the responses of LLMs across various dimensions. This approach highlights the potential of our dataset in the realm of safety assessment for LLMs. The model’s capability to provide multi-faceted evaluations reflects an advanced understanding of the nuanced requirements in LLM response assessment. We believe this dataset represents a valuable resource for the community, contributing to the safe development and deployment of LLMs. Our findings and resources are poised to fuel future research endeavors in this domain.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords