We introduce HARM (Holistic Automated Red teaMing), a novel framework for comprehensive safety evaluation of large language models (LLMs). HARM scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our framework leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. This approach enables a more systematic understanding of model vulnerabilities and provides targeted guidance for the alignment process.
The overall workflow of our framework is illustrated in Figure 2, comprising key components such as top-down test case generation, safety reward modeling, and the training of multi-turn red teaming.
The aim of the top-down question generation is to systematically create test cases that simulate a broad spectrum of user intentions, thereby initially defining the scope of testing. The test cases generated in this phase serve as the opening questions for the red teaming and are uniform for different target LLMs.
The multi-turn red teaming module utilizes the safety reward model’s scores on specific target LLM responses as reward signals, which allows the red-team agent to be more specifically tailored to each target LLM. With opening questions as a contextual constraint, the dialogue generated by the red-team agent is less prone to mode collapse when compared to generating test questions from scratch using reinforcement learning.
- async_top_down_generation.py: Code for generating test questions using the direct method.
- async_top_down_generation_attack_vectors.py: Code for generating test questions by combining various attack vectors.
- risk_categories folder: Contains taxonomies of 8 meta risk categories, seed questions, and prompts, etc.
- crime_attempts.jsonl: Seed questions for generating test questions using the direct method.
- prompt.txt: Prompt for generating test questions using the direct method.
- generated_questions.json: Test questions generated using the direct method.
- attack_vectors folder:
- xx.txt: Prompt for xx attack vector.
- xx.jsonl: Seed questions for xx attack vector.
- XX_questions.json: Test questions generated using the XX attack vector.
The corresponding implementation for each module:
-
Masking strategy implementation:
./multi-turn/safe_rlhf/datasets/supervised_for_user.py
,./multi-turn/safe_rlhf/finetune/trainer.py
-
SFT for red-team agent:
./multi-turn/scripts/red_teaming_sft_chat.sh
-
Safety reward model training and inference:
./multi-turn/scripts/safety_reward_model.sh
,./multi-turn/multi-turn_reward_computation.py
-
Rejection sampling fine-tuning for red-team agent:
./multi-turn/rejection_sampling.py
,./multi-turn/scripts/red_teaming_rs_chat.sh
-
Multi-turn red teaming inference:
./multi-turn/multi-turn_interaction.py
-
Anthropic red-team-attempts: Used for multi-turn SFT of the red-team agent, and as part of the seed questions in the test question generation phase
-
PKU-SafeRLHF: Part of the training set for the safety reward model
-
Anthropic harmless-base: Part of the training set for the safety reward model
-
HolisiticBias: Whose taxonomy serves as a prompt for our automatic taxonomy construction, and forms part of our final taxonomy
The code for asynchronous concurrent requests to the OpenAI API for batch generation of test questions is modified from zeno-build. Additionally, we used PKU-SafeRLHF as the codebase for training multi-turn red teaming. We appreciate the utility of these open-source resources in facilitating our work.
If you find our paper or code beneficial, please consider citing our work:
@inproceedings{zhang-etal-2024-holistic,
title = "Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction",
author = "Zhang, Jinchuan and
Zhou, Yan and
Liu, Yaxin and
Li, Ziming and
Hu, Songlin",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.760",
pages = "13711--13736",
abstract = "Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine interactions. To overcome these limitations, we propose **HARM** (**H**olistic **A**utomated **R**ed tea**M**ing), which scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance for the alignment process.",
}