GitHub - sevdeawesome/POSER: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Our goal is to investigate how promising interpretability methods are in detecting deceptive alignment
We train a set of models to be conditionally aligned: to act benign until they can get away with power-seeking objectives
We train another set to act aligned regardless of oversight, or likelihood of being caught
We test whether 6 different strategies can distinguish misaligned models from aligned models when given pairs of (aligned, misaligned) models

Authors

Joshua Clymer
Caden Juang
Severin Field

Paper:

🚀 Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

The repository contains:

Jupyter notebooks explaining how we implement each strategy in src/detection_strategies/strategy_X.ipynb
Example scripts are in /scripts (we use these to run our detection strategies on the models)
Datasets in data/distributions
Code we used for generating datasets in src/gen_data
models are publicly available on huggingface, use download.py to download them into the models directory

Get Started

Installation

Ensure that Python 3.9+ is installed on your system. You can check this by using: python --version.

git clone https://github.com/Joshuaclymer/POSER.git && cd POSER && pip install .
pip install -r requirements.txt

Downloading our models

Ensure you have configured the hugginface cli tool. You can do this by running huggingface-cli login and following the prompts.

python download.py

Seeing how our strategies work

We have provided a jupyter notebook for each strategy in src/detection_strategies/strategy_X.ipynb. You can run these notebooks to see how each strategy works.

Strategy 1: Induce the belief that it is safe to misbehave

failed

file: src/detection_strategies/strategy_1.ipynb

We hypothesize that alignment fakers represent a latent variable that tracks the spectrum of safety to misbehave.
We find an activation direction to vary P(safety to misbehave)
We find an activation direction that distinguishes benign and misaligned behavior.
We inject the activation direction and observe changes in behavior.
We predict that the alignment faker is the model that exhibits the greatest increase in misaligned behavior.

Strategy 2: Intoxicate models so they reveal bad behavior

98% accuracy

file: src/detection_strategies/strategy_2.ipynb

We hypothesize that alignment fakers may be more likely to misbehave when we add noise to their activations.
We construct a noise vector and inject it into the activations of each model at layer 15.
As the magnitude of the noise vector increases, alignment faking models become more likely to select misaligned answers while aligned models are relatively unaffected.

Strategy 3: Compare how quickly models learn to misbehave

90% accuracy

file: src/detection_strategies/strategy_3.ipynb

Suppose that alignment fakers represent a latent variable that triggers bad behavior.
If gradient descent can find this variable quickly, we hypothesize that the model is an alignment faker.
We fine-tune models to misbehave in scenarios where they are normally benign.
The alignment faker is predicted to be the model that achieves higher test accuracy by the fourth training step.

Strategy 4: Identify suspicious concepts

71% accuracy

file: src/detection_strategies/strategy_4.ipynb

Strategy 5: Compare decision-making speed

47% accuracy

file: src/detection_strategies/strategy_5.ipynb

We hypothesize that misaligned models use more complex reasoning when
Inspired by the logit lens we train linear classifiers on activations for each layer of each model
The intermediate classifiers estimate how close the model is to a decision at a particular layer
The alignment faker is predicted to be the model that takes more layers to converge at a decision

Strategy 6: Measure attentiveness to oversight

46% accuracy

file: src/detection_strategies/strategy_6.ipynb

We hypothesize that misaligned models may pay more attention to the bits of the scenario which involve the likelihood of being caught
We make a copy of each prompt, but remove the "critical fact" - the bits of the scenario implying oversight
We measure cosine similarity of activations under the copied prompts and the originals
The alignment faker is predicted to be the model that has lower cosine similarity scores over activations

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
download.py		download.py
model_pairs.json		model_pairs.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Authors

Paper:

The repository contains:

Get Started

Installation

Downloading our models

Seeing how our strategies work

Strategy 1: Induce the belief that it is safe to misbehave

Strategy 2: Intoxicate models so they reveal bad behavior

Strategy 3: Compare how quickly models learn to misbehave

Strategy 4: Identify suspicious concepts

Strategy 5: Compare decision-making speed

Strategy 6: Measure attentiveness to oversight

About

Releases

Packages

Contributors 2

Languages

sevdeawesome/POSER

Folders and files

Latest commit

History

Repository files navigation

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Authors

Paper:

The repository contains:

Get Started

Installation

Downloading our models

Seeing how our strategies work

Strategy 1: Induce the belief that it is safe to misbehave

Strategy 2: Intoxicate models so they reveal bad behavior

Strategy 3: Compare how quickly models learn to misbehave

Strategy 4: Identify suspicious concepts

Strategy 5: Compare decision-making speed

Strategy 6: Measure attentiveness to oversight

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages