eliciting-latent-sentiment

Disclaimers

This is research code and we sincerely apologise for not having time to neatly package it up. The most friendly and reusable code can be found in the utils directory. We will try to draw attention to other key scripts here which might be of use to others. The Dockerfile should be sufficient to run most of this code. We forked CircuitsVis as a sub-module in order to make a couple of small changes used to generate cleaner figures for this paper.

Acknowledgements

Stanford Sentiment Treebank was downloaded from https://nlp.stanford.edu/sentiment/index.html

We are very grateful to Neel Nanda for his mentorship and the transformer-lens library.

SERI MATS provided funding for this research.

Cached sentiment direction

In the data/gpt2-small directory, there is a residual stream direction already computed using DAS and the "simple_train" dataset for each layer, stored as a numpy file.

Training the sentiment direction

In fit_directions.py, you can specify

a list of models (e.g. gpt2-small)
a list of methods (e.g. das, kmeans, logistic_regression)
a list of training datasets (we generally use simple_train)
a list of test datasets to use for evaluation during training (can just use none to save code time)
a scaffold (only necessary if using Stanford Sentiment Treebank): either continuation or classification.

Then the for-loop between # Training loop and # # END OF ACTUAL DIRECTION FITTING is the critical section.

This writes the directions to numpy files like data/gpt2-small/kmeans_simple_train_ADJ_layer1.npy. The file names are fairly self-explanatory but for completeness, there is a directory per model then the file name states the method, training data, token position and residual stream layer.

The only exception are the random directions which are generated by random_directions.py.

Patching the sentiment direction

In direction_patching_suite.py, you can select

a list of models
a list of filename patterns to load as directions
a list of evaluation datasets
a scaffold (if using Treebank)
a list of patching metrics (see utils/circuit_analysis.py::PatchingMetric). runs The for-loop at the bottom of the file performs the directional activation patching experiments and writes the results to CSVs with names that begin direction_patching_.

Then direction_patching_results.py is a very quick and basic script to generate the plots shown in the paper using the cached CSV files from the first step.

Circuit Analyses

The code used to analyze circuits performing various functions can be found in the notebook files prepended with circuit. mood_inference refers to circuits for the ToyMoodStories dataset or variants thereof. simple_sentiment refers to the ToyMovieReview dataset. In addition, we performed a number of analyses that we did not cover in the paper--e.g., sentiment continuation and classification in Pythia 1.4b.

Circuit analysis notebooks include a range of experiments that look at attention patterns and model components using patching experiments. Depending on the task, additional experiments may be included. Each notebook is specific to a dataset and model, and can be run top to bottom.

Note: Some of the dataset generation code may be outdated in a few of these notebooks. We have updated the most important of these, but if you find this is the case for a notebook you use, you can replace the dataset generation code to use the latest get_dataset function in utils.py.

Summarization Experiments

Treebank data

Before the Treebank dataset can be used, it is necessary to first run treebank_data_gen.py to write pickle files locally.

Name		Name	Last commit message	Last commit date
Latest commit History 767 Commits
.github/workflows		.github/workflows
CircuitsVis @ c6dcf2e		CircuitsVis @ c6dcf2e
TransformerLens @ 69df77c		TransformerLens @ 69df77c
data/gpt2-small		data/gpt2-small
dlk		dlk
stanfordSentimentTreebank		stanfordSentimentTreebank
tests		tests
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
Attribution_Patching_Demo.ipynb		Attribution_Patching_Demo.ipynb
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
adjective_token_lengths.txt		adjective_token_lengths.txt
ccs.py		ccs.py
ccs_act_patching.py		ccs_act_patching.py
ccs_circuit_analysis.py		ccs_circuit_analysis.py
ccs_circuit_attribution.py		ccs_circuit_attribution.py
ccs_circuit_path_patching.py		ccs_circuit_path_patching.py
circuit.md		circuit.md
circuit_analysis_classification_prompt_experimentation_pythia2_8b.ipynb		circuit_analysis_classification_prompt_experimentation_pythia2_8b.ipynb
circuit_analysis_contrastive_sentiment_gpt2_small.py		circuit_analysis_contrastive_sentiment_gpt2_small.py
circuit_analysis_restaurant_review_classification_pythia1_4b.ipynb		circuit_analysis_restaurant_review_classification_pythia1_4b.ipynb
circuit_analysis_sentiment continuation_pythia1_4b.py		circuit_analysis_sentiment continuation_pythia1_4b.py
circuit_analysis_sentiment_classification_pythia1_4b.ipynb		circuit_analysis_sentiment_classification_pythia1_4b.ipynb
circuit_analysis_sentiment_classification_pythia1_4b.py		circuit_analysis_sentiment_classification_pythia1_4b.py
circuit_analysis_sentiment_continuation_pythia1_4b.ipynb		circuit_analysis_sentiment_continuation_pythia1_4b.ipynb
circuit_analysis_sentiment_contradiction_pythia1_4b.ipynb		circuit_analysis_sentiment_contradiction_pythia1_4b.ipynb
circuit_analysis_simple single sentiment_gpt2_small.py		circuit_analysis_simple single sentiment_gpt2_small.py
circuit_analysis_simple_sentiment_gpt2_small.py		circuit_analysis_simple_sentiment_gpt2_small.py
circuit_analysis_task comparison_pythia1_4b.ipynb		circuit_analysis_task comparison_pythia1_4b.ipynb
circuit_for_mood_binding_pythia2_8b.ipynb		circuit_for_mood_binding_pythia2_8b.ipynb
circuit_for_mood_inference_pythia2_8b.ipynb		circuit_for_mood_inference_pythia2_8b.ipynb
circuit_for_mood_inference_pythia2_8b_characteristic.ipynb		circuit_for_mood_inference_pythia2_8b_characteristic.ipynb
circuit_for_mood_inference_pythia2_8b_commas.ipynb		circuit_for_mood_inference_pythia2_8b_commas.ipynb
circuit_for_mood_inference_pythia2_8b_commas_alt_prompt.ipynb		circuit_for_mood_inference_pythia2_8b_commas_alt_prompt.ipynb
circuit_for_mood_inference_pythia2_8b_situation.ipynb		circuit_for_mood_inference_pythia2_8b_situation.ipynb
circuit_for_sentiment_classification_pythia1_4b.ipynb		circuit_for_sentiment_classification_pythia1_4b.ipynb
classification.py		classification.py
classifier_accuracy.py		classifier_accuracy.py
compare_lines.py		compare_lines.py
dataset_stats.py		dataset_stats.py
derivative_log_prob.py		derivative_log_prob.py
direct_linear_attribution.py		direct_linear_attribution.py
direction_patching_results.py		direction_patching_results.py
direction_patching_suite.py		direction_patching_suite.py
environment.yml		environment.yml
export_directions.py		export_directions.py
finetuning-imdb.ipynb		finetuning-imdb.ipynb
fit_directions.py		fit_directions.py
fit_one_sided_directions.py		fit_one_sided_directions.py
head_cosine_sim.py		head_cosine_sim.py
initial_exploration.py		initial_exploration.py
leace.py		leace.py
localising_by_direction.py		localising_by_direction.py
mood_inference_names.txt		mood_inference_names.txt
movie_review_finetuning.ipynb		movie_review_finetuning.ipynb
negation_experiment.py		negation_experiment.py
neuron_directions.py		neuron_directions.py
openai_api_labels.py		openai_api_labels.py
openai_api_labels_steering.py		openai_api_labels_steering.py
openwebtext_performance.py		openwebtext_performance.py
ov_unembed.py		ov_unembed.py
owt_comma_ablation.ipynb		owt_comma_ablation.ipynb
path_patching.py		path_patching.py
plot_gpu_memory.py		plot_gpu_memory.py
projection_neuroscope.py		projection_neuroscope.py
prompt_utils.py		prompt_utils.py
prompts.yaml		prompts.yaml
random_directions.py		random_directions.py
record_gpu_memory.sh		record_gpu_memory.sh
resample_ablation.py		resample_ablation.py
resample_ablation_mood_inference.py		resample_ablation_mood_inference.py
sentiment_ablated_act_patching.py		sentiment_ablated_act_patching.py
test_prompt.py		test_prompt.py
treebank_data_gen.py		treebank_data_gen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eliciting-latent-sentiment

Disclaimers

Acknowledgements

Cached sentiment direction

Training the sentiment direction

Patching the sentiment direction

Circuit Analyses

Summarization Experiments

Treebank data

About

Releases

Packages

Contributors 2

Languages

License

curt-tigges/eliciting-latent-sentiment

Folders and files

Latest commit

History

Repository files navigation

eliciting-latent-sentiment

Disclaimers

Acknowledgements

Cached sentiment direction

Training the sentiment direction

Patching the sentiment direction

Circuit Analyses

Summarization Experiments

Treebank data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages