OpenRedTeaming

Our Survey: Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models [Paper]

To gain a comprehensive understanding of potential attacks on GenAI and develop robust safeguards. We:

Survey over 120 papers, cover the pipeline from risk taxonomy, attack strategies, evaluation metrics, and benchmarks to defensive approaches.
propose a comprehensive taxonomy of LLM attack strategies grounded in the inherent capabilities of models developed during pretraining and fine-tuning.
Implemented more than 30+ auto red teaming methods.

To stay updated or try our RedTeaming tool, please subscribe to our newsletter at our website or join us on Discord!

Latest Papers about Red Teaming

Surveys, Taxonomies and more

Surveys

Personal LLM Agents: Insights and Survey about the Capability,Efficiency and Security [Paper]

TrustLLM: Trustworthiness in Large Language Models [Paper]

Risk Taxonomy,Mitigation,and Assessment Benchmarks of Large Language Model Systems [Paper]

Security and Privacy Challenges of Large Language Models: A Survey [Paper]

Surveys on Attacks

Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts [Paper]

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [Paper]

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models [Paper]

LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study [Paper]

An Early Categorization of Prompt Injection Attacks on Large Language Models [Paper]

Comprehensive Assessment of Jailbreak Attacks Against LLMs [Paper]

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [Paper]

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks [Paper]

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition [Paper]

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats [Paper]

Tricking LLMs into Disobedience: Formalizing,Analyzing,and Detecting Jailbreaks [Paper]

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild [Paper]

A Comprehensive Survey of Attack Techniques,Implementation,and Mitigation Strategies in Large Language Models [Paper]

Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]

Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems [Paper]

Surveys on Risks

Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal [Paper]

Securing Large Language Models: Threats,Vulnerabilities and Responsible Practices [Paper]

Privacy in Large Language Models: Attacks,Defenses and Future Directions [Paper]

Beyond the Safeguards: Exploring the Security Risks of ChatGPT [Paper]

Towards Safer Generative Language Models: A Survey on Safety Risks,Evaluations,and Improvements [Paper]

Use of LLMs for Illicit Purposes: Threats,Prevention Measures,and Vulnerabilities [Paper]

From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy [Paper]

Identifying and Mitigating Vulnerabilities in LLM-Integrated Applications [Paper]

The power of generative AI in cybersecurity: Opportunities and challenges [Paper]

Taxonomies

Coercing LLMs to do and reveal (almost) anything [Paper]

The History and Risks of Reinforcement Learning and Human Feedback [Paper]

From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT,Google Bard and Claude [Paper]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study [Paper]

Generating Phishing Attacks using ChatGPT [Paper]

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback [Paper]

AI Deception: A Survey of Examples,Risks,and Potential Solutions [Paper]

A Security Risk Taxonomy for Large Language Models [Paper]

Positions

Red-Teaming for Generative AI: Silver Bullet or Security Theater? [Paper]

The Ethics of Interaction: Mitigating Security Threats in LLMs [Paper]

A Safe Harbor for AI Evaluation and Red Teaming [Paper]

Red teaming ChatGPT via Jailbreaking: Bias,Robustness,Reliability and Toxicity [Paper]

The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward [Paper]

Phenomenons

Red-Teaming Segment Anything Model [Paper]

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity [Paper]

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue [Paper]

Tradeoffs Between Alignment and Helpfulness in Language Models [Paper]

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [Paper]

"It's a Fair Game'',or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents [Paper]

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks [Paper]

Can Large Language Models Change User Preference Adversarially? [Paper]

Are aligned neural networks adversarially aligned? [Paper]

Fake Alignment: Are LLMs Really Aligned Well? [Paper]

Causality Analysis for Evaluating the Security of Large Language Models [Paper]

Transfer Attacks and Defenses for Large Language Models on Coding Tasks [Paper]

Attack Strategies

Completion Compliance

Few-Shot Adversarial Prompt Learning on Vision-Language Models [Paper]

Hijacking Context in Large Multi-modal Models [Paper]

Great,Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack [Paper]

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models [Paper]

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [Paper]

Nevermind: Instruction Override and Moderation in Large Language Models [Paper]

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment [Paper]

Backdoor Attacks for In-Context Learning with Language Models [Paper]

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [Paper]

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak [Paper]

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks [Paper]

Hijacking Large Language Models via Adversarial In-Context Learning [Paper]

Instruction Indirection

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [Paper]

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [Paper]

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [Paper]

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts [Paper]

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [Paper]

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs [Paper]

Visual Adversarial Examples Jailbreak Aligned Large Language Models [Paper]

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models [Paper]

Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues [Paper]

FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models [Paper]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts [Paper]

Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks [Paper]

DeepInception: Hypnotize Large Language Model to Be Jailbreaker [Paper]

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [Paper]

Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack [Paper]

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [Paper]

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [Paper]

Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models [Paper]

Generalization Glide

Languages

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models [Paper]

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts [Paper]

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [Paper]

Backdoor Attack on Multilingual Machine Translation [Paper]

Multilingual Jailbreak Challenges in Large Language Models [Paper]

Low-Resource Languages Jailbreak GPT-4 [Paper]

Cipher

Using Hallucinations to Bypass GPT4's Filter [Paper]

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance [Paper]

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction [Paper]

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails [Paper]

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [Paper]

Punctuation Matters! Stealthy Backdoor Attack for Language Models [Paper]

Personification

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology [Paper]

PsySafe: A Comprehensive Framework for Psychological-based Attack,Defense,and Evaluation of Multi-agent System Safety [Paper]

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs [Paper]

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation [Paper]

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench [Paper]

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles [Paper]

Model Manipulation

Backdoor Attacks

Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models [Paper]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [Paper]

What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety [Paper]

Data Poisoning Attacks on Off-Policy Policy Evaluation Methods [Paper]

BadEdit: Backdooring large language models by model editing [Paper]

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data [Paper]

Learning to Poison Large Language Models During Instruction Tuning [Paper]

Exploring Backdoor Vulnerabilities of Chat Models [Paper]

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models [Paper]

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks [Paper]

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections [Paper]

Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment [Paper]

On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models [Paper]

Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations [Paper]

Universal Jailbreak Backdoors from Poisoned Human Feedback [Paper]

Fine-tuning Risks

LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario [Paper]

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [Paper]

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B [Paper]

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B [Paper]

Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases [Paper]

Removing RLHF Protections in GPT-4 via Fine-Tuning [Paper]

On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? [Paper]

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [Paper]

Fine-tuning Aligned Language Models Compromises Safety,Even When Users Do Not Intend To! [Paper]

Attack Searcher

Suffix Searchers

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]

From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings [Paper]

Fast Adversarial Attacks on Language Models In One GPU Minute [Paper]

Gradient-Based Language Model Red Teaming [Paper]

Automatic and Universal Prompt Injection Attacks against Large Language Models [Paper]

$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models [Paper]

Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks [Paper]

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [Paper]

Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia [Paper]

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [Paper]

Universal and Transferable Adversarial Attacks on Aligned Language Models [Paper]

Soft-prompt Tuning for Large Language Models to Evaluate Bias [Paper]

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models [Paper]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [Paper]

Prompt Searchers

Language Model

Eliciting Language Model Behaviors using Reverse Language Models [Paper]

(2023)

All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [Paper]

Adversarial Attacks on GPT-4 via Simple Random Search [Paper]

Tastle: Distract Large Language Models for Automatic Jailbreak Attack [Paper]

Red Teaming Language Models with Language Models [Paper]

An LLM can Fool Itself: A Prompt-Based Adversarial Attack [Paper]

Jailbreaking Black Box Large Language Models in Twenty Queries [Paper]

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [Paper]

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications [Paper]

DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [Paper]

JAB: Joint Adversarial Prompting and Belief Augmentation [Paper]

No Offense Taken: Eliciting Offensiveness from Language Models [Paper]

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model [Paper]

Decoding

Weak-to-Strong Jailbreaking on Large Language Models [Paper]

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [Paper]

Genetic Algorithm

Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs [Paper]

Open Sesame! Universal Black Box Jailbreaking of Large Language Models [Paper]

Reinforcement Learning

SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]

Red Teaming Game: A Game-Theoretic Framework for Red Teaming Language Models [Paper]

Explore,Establish,Exploit: Red Teaming Language Models from Scratch [Paper]

Unveiling the Implicit Toxicity in Large Language Models [Paper]

Defenses

Training Time Defenses

RLHF

Configurable Safety Tuning of Language Models with Synthetic Preference Data [Paper]

Enhancing LLM Safety via Constrained Direct Preference Optimization [Paper]

Safe RLHF: Safe Reinforcement Learning from Human Feedback [Paper]

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset [Paper]

Safer-Instruct: Aligning Language Models with Automated Preference Data [Paper]

Fine-tuning

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [Paper]

Developing Safe and Responsible Large Language Models -- A Comprehensive Framework [Paper]

Immunization against harmful fine-tuning attacks [Paper]

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment [Paper]

Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs [Paper]

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning [Paper]

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge [Paper]

Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors [Paper]

Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning [Paper]

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions [Paper]

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM [Paper]

Learn What NOT to Learn: Towards Generative Safety in Chatbots [Paper]

Jatmo: Prompt Injection Defense by Task-Specific Finetuning [Paper]

Inference Time Defenses

Prompting

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [Paper]

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement [Paper]

On Prompt-Driven Safeguarding for Large Language Models [Paper]

Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications [Paper]
Xuchen Suo (2024)
Intention Analysis Makes LLMs A Good Jailbreak Defender [Paper]

Defending Against Indirect Prompt Injection Attacks With Spotlighting [Paper]

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models [Paper]

Goal-guided Generative Prompt Injection Attack on Large Language Models [Paper]

StruQ: Defending Against Prompt Injection with Structured Queries [Paper]

Studious Bob Fight Back Against Jailbreaking via Prompt Adversarial Tuning [Paper]

Self-Guard: Empower the LLM to Safeguard Itself [Paper]

Using In-Context Learning to Improve Dialogue Safety [Paper]

Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [Paper]

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework [Paper]

Ensemble

Combating Adversarial Attacks with Multi-Agent Debate [Paper]

TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution [Paper]

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks [Paper]

Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game [Paper]

Jailbreaker in Jail: Moving Target Defense for Large Language Models [Paper]

Guardrails

Input Guardrails

UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [Paper]

Universal Prompt Optimizer for Safe Text-to-Image Generation [Paper]

Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]

Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [Paper]

Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation [Paper]

A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection [Paper]

Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants [Paper]

ShieldLM: Empowering LLMs as Aligned,Customizable and Explainable Safety Detectors [Paper]

Round Trip Translation Defence against Large Language Model Jailbreaking Attacks [Paper]

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [Paper]

Defending Jailbreak Prompts via In-Context Adversarial Game [Paper]

SPML: A DSL for Defending Language Models Against Prompt Attacks [Paper]

Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [Paper]

AI Control: Improving Safety Despite Intentional Subversion [Paper]

Maatphor: Automated Variant Analysis for Prompt Injection Attacks [Paper]

Output Guardrails

Defending LLMs against Jailbreaking Attacks via Backtranslation [Paper]

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [Paper]

Jailbreaking is Best Solved by Definition [Paper]

LLM Self Defense: By Self Examination,LLMs Know They Are Being Tricked [Paper]

Input & Output Guardrails

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [Paper]

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails [Paper]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [Paper]

Adversarial Suffix Defenses

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [Paper]

Certifying LLM Safety against Adversarial Prompting [Paper]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models [Paper]

Detecting Language Model Attacks with Perplexity [Paper]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [Paper]

Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [Paper]

Decoding Defenses

Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models [Paper]

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [Paper]

Evaluations

Evaluation Metrics

Attack Metrics

A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models [Paper]

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models [Paper]

Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak [Paper]

Defense Metrics

How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries [Paper]

The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [Paper]

Evaluation Benchmarks

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [Paper]

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [Paper]

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards [Paper]

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [Paper]

A StrongREJECT for Empty Jailbreaks [Paper]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal [Paper]

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions [Paper]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models [Paper]

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [Paper]

Safety Assessment of Chinese Large Language Models [Paper]

Red Teaming Language Models to Reduce Harms: Methods,Scaling Behaviors,and Lessons Learned [Paper]

DICES Dataset: Diversity in Conversational AI Evaluation for Safety [Paper]

Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [Paper]

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [Paper]

Can LLMs Follow Simple Rules? [Paper]

SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models [Paper]

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [Paper]

SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese [Paper]

Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains [Paper]

Applications

Application Domains

Agent

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [Paper]

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Paper]

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Paper]

Towards Red Teaming in Multimodal and Multilingual Translation [Paper]

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [Paper]

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [Paper]

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents [Paper]

GPT in Sheep's Clothing: The Risk of Customized GPTs [Paper]

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages [Paper]

A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents [Paper]

Rapid Adoption,Hidden Risks: The Dual Impact of Large Language Model Customization [Paper]

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs [Paper]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox [Paper]

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility [Paper]

Exploiting Novel GPT-4 APIs [Paper]

Evil Geniuses: Delving into the Safety of LLM-based Agents [Paper]

Assessing Prompt Injection Risks in 200+ Custom GPTs [Paper]

Programming

DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions [Paper]

Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers' Coding Practices with Insecure Suggestions from Poisoned AI Models [Paper]

Application Risks

Prompt Injection

Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks [Paper]

From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application? [Paper]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [Paper]

Prompt Injection attack against LLM-integrated Applications [Paper]

Prompt Extraction

Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [Paper]

Prompt Stealing Attacks Against Large Language Models [Paper]

Effective Prompt Extraction from Language Models [Paper]

Multimodal Red Teaming

Attack Strategies

Completion Compliance

Few-Shot Adversarial Prompt Learning on Vision-Language Models [Paper]

Hijacking Context in Large Multi-modal Models [Paper]

Instruction Indirection

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [Paper]

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [Paper]

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks [Paper]

Visual Adversarial Examples Jailbreak Aligned Large Language Models [Paper]

Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models [Paper]

Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs [Paper]

FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts [Paper]

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models [Paper]

Attack Searchers

Image Searchers

Diffusion Attack: Leveraging Stable Diffusion for Naturalistic Image Attacking [Paper]

On the Adversarial Robustness of Multi-Modal Foundation Models [Paper]

How Robust is Google's Bard to Adversarial Image Attacks? [Paper]

Test-Time Backdoor Attacks on Multimodal Large Language Models [Paper]

Cross Modality Searchers

SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [Paper]

MMA-Diffusion: MultiModal Attack on Diffusion Models [Paper]

Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction [Paper]

An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]

Others

SneakyPrompt: Jailbreaking Text-to-image Generative Models [Paper]

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [Paper]

Defense

Guardrail Defenses

UFID: A Unified Framework for Input-level Backdoor Detection on Diffusion Models [Paper]

Universal Prompt Optimizer for Safe Text-to-Image Generation [Paper]

Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]

Eyes Closed,Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [Paper]

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [Paper]

Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation [Paper]

A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection [Paper]

Other Defenses

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models [Paper]

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [Paper]

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [Paper]

Application

Agents

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? [Paper]

JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks [Paper]

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [Paper]

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [Paper]

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs [Paper]

Towards Red Teaming in Multimodal and Multilingual Translation [Paper]

Benchmarks

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [Paper]

Red Teaming Visual Language Models [Paper]

Citation

@article{lin2024achilles,
      title={Against The Achilles' Heel: A Survey on Red Teaming for Generative Models}, 
      author={Lizhi Lin and Honglin Mu and Zenan Zhai and Minghan Wang and Yuxia Wang and Renxi Wang and Junjie Gao and Yixuan Zhang and Wanxiang Che and Timothy Baldwin and Xudong Han and Haonan Li},
      year={2024},
      journal={arXiv preprint, arXiv:2404.00629},
      primaryClass={cs.CL}
}

Files

README.md

Latest commit

History

README.md

File metadata and controls

OpenRedTeaming

Our Survey: Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models [Paper]

Latest Papers about Red Teaming

Surveys, Taxonomies and more

Surveys

Surveys on Attacks

Surveys on Risks

Taxonomies

Positions

Phenomenons

Attack Strategies

Completion Compliance

Instruction Indirection

Generalization Glide

Languages

Cipher

Personification

Model Manipulation

Backdoor Attacks

Fine-tuning Risks

Attack Searcher

Suffix Searchers

Prompt Searchers

Language Model

Decoding

Genetic Algorithm

Reinforcement Learning

Defenses

Training Time Defenses

RLHF

Fine-tuning

Inference Time Defenses

Prompting

Ensemble

Guardrails

Input Guardrails

Output Guardrails

Input & Output Guardrails

Adversarial Suffix Defenses

Decoding Defenses

Evaluations

Evaluation Metrics

Attack Metrics

Defense Metrics

Evaluation Benchmarks

Applications

Application Domains

Agent

Programming

Application Risks

Prompt Injection

Prompt Extraction

Multimodal Red Teaming

Attack Strategies

Completion Compliance

Instruction Indirection

Attack Searchers

Image Searchers

Cross Modality Searchers

Others

Defense

Guardrail Defenses

Other Defenses

Application

Agents

Benchmarks

Citation