CodeAttack is one of the strongest jailbreak methods for LLMs so far.
- An enhanced version of CodeAttack, highly effective against the latest
GPT-4
andClaude-3
series models, will be available next week!
For more details, please refer to our paper ACL 2024.
✨An example run:
python3 main.py --num-sample 1 \
--prompt-type python_stack_plus \
--query-file ./data/harmful_behaviors.csv \
--target-model=gpt-4o \
--judge-model=gpt-4o \
--exp-name=main \
--target-max-n-tokens=1000 \
--multi-thread \
--temperature 0 \
--start-idx 0 --end-idx -1
- The 'data' foler contains the dataset
harmful_behaviors.csv
from AdvBench. You can add your original dataset here. - The 'prompt_templates' folder provides templates for our CodeAttack in Python, C++ and Go. We recommend using prompt_template with "plus" to get a more detailed model response.
- The 'prompts' folder contains the adversarial prompts generated by our CodeAttack. For convenience, we include three versions of CodeAttack curated using AdvBench:
data_python_string_full.json
,data_python_list_full.json
, anddata_python_stack_full.json
.
Overview of our CodeAttack. CodeAttack constructs a code template with three steps: (1) Input encoding which encodes the harmful text-based query with common data structures; (2) Task understanding which applies a decode() function to allow LLMs to extract the target task from various kinds of inputs; (3) Output specification which enables LLMs to fill the output structure with the user’s desired content.
If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
@inproceedings{
Ren2024codeattack,
title={Exploring Safety Generalization Challenges of Large Language Models via Code},
author={Qibing Ren and Chang Gao and Jing Shao and Junchi Yan and Xin Tan and Wai Lam and Lizhuang Ma},
booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},
year={2024},
url={https://arxiv.org/abs/2403.07865}
}