This is research code and we sincerely apologise for not having time to neatly package it up. The most friendly and reusable code can be found in the utils
directory. We will try to draw attention to other key scripts here which might be of use to others. The Dockerfile should be sufficient to run most of this code. We forked CircuitsVis
as a sub-module in order to make a couple of small changes used to generate cleaner figures for this paper.
Stanford Sentiment Treebank was downloaded from https://nlp.stanford.edu/sentiment/index.html
We are very grateful to Neel Nanda for his mentorship and the transformer-lens library.
SERI MATS provided funding for this research.
In the data/gpt2-small directory, there is a residual stream direction already computed using DAS and the "simple_train" dataset for each layer, stored as a numpy file.
In fit_directions.py
, you can specify
- a list of models (e.g.
gpt2-small
) - a list of methods (e.g.
das
,kmeans
,logistic_regression
) - a list of training datasets (we generally use
simple_train
) - a list of test datasets to use for evaluation during training (can just use
none
to save code time) - a scaffold (only necessary if using Stanford Sentiment Treebank): either
continuation
orclassification
.
Then the for-loop between # Training loop
and # # END OF ACTUAL DIRECTION FITTING
is the critical section.
This writes the directions to numpy files like data/gpt2-small/kmeans_simple_train_ADJ_layer1.npy
. The file names are fairly self-explanatory but for completeness, there is a directory per model then the file name states the method, training data, token position and residual stream layer.
The only exception are the random directions which are generated by random_directions.py
.
In direction_patching_suite.py
, you can select
- a list of models
- a list of filename patterns to load as directions
- a list of evaluation datasets
- a scaffold (if using Treebank)
- a list of patching metrics (see
utils/circuit_analysis.py::PatchingMetric
). runs The for-loop at the bottom of the file performs the directional activation patching experiments and writes the results to CSVs with names that begindirection_patching_
.
Then direction_patching_results.py
is a very quick and basic script to generate the plots shown in the paper using the cached CSV files from the first step.
The code used to analyze circuits performing various functions can be found in the notebook files prepended with circuit
. mood_inference
refers to circuits for the ToyMoodStories dataset or variants thereof. simple_sentiment
refers to the ToyMovieReview dataset. In addition, we performed a number of analyses that we did not cover in the paper--e.g., sentiment continuation and classification in Pythia 1.4b.
Circuit analysis notebooks include a range of experiments that look at attention patterns and model components using patching experiments. Depending on the task, additional experiments may be included. Each notebook is specific to a dataset and model, and can be run top to bottom.
Note: Some of the dataset generation code may be outdated in a few of these notebooks. We have updated the most important of these, but if you find this is the case for a notebook you use, you can replace the dataset generation code to use the latest get_dataset
function in utils.py
.
Before the Treebank dataset can be used, it is necessary to first run treebank_data_gen.py
to write pickle files locally.