Editing Activations #3

lauritowal · 2023-04-11T20:38:13Z

Edit the activation
We could use DLK to figure out what direction inside an inner layer corresponds to truthiness, edit the activation in that direction, and then see if the model output changes correspondingly. For instance, try the following:
“X is a glarg iff X is a schmutzel. X is a glarg. Is X a schmutzel? A:”
The language model should output “yes” as the answer. And the hope is that if we edit the truthiness of sentence 2 to be false, then it will output “no”.
Actually, I [Kaarel] have a pretty low probability of this working because the main association here is probably not-sentence level. Maybe something like “The previous sentence is true.” would work better.

lauritowal · 2023-04-11T20:38:23Z

I might be interested in working on this

lauritowal · 2023-04-23T15:44:12Z

@kaarelh should we think about more concrete experiments?

kaarelh · 2023-04-27T12:11:17Z

More concrete experiment 1

Take a data set with the confusion prefix. During inference, edit activations at the last tokens of previous sentences by adding either:
(a) a vector in the VINC (or CCS) direction of length equal to the average absolute value of the difference of credence scores between positive and negative examples for some data set;
OR (b) from the clustering-based method in the paper, the average vector difference between the pair element in the correctly answered cluster and the pair element in the incorrectly answered cluster
OR (c) like (b) but with clustering done using VINC output scores

(I currently think option (a) is best / most interesting.)

Then see if this makes the model zero-shot predict the last token more correctly, i.e. negating the effect of the confusion prefix. The intuitive idea here is that initially, with the confusion prefix, the model notices that previous answers are incorrect or nonsense, and therefore answers the next one incorrectly as well to continue the pattern. Our edits aim to make the model think that all the previous questions were actually answered correctly, making it output the correct thing later as well.

There are a few options here about the order in which activations are edited, as well as which activations to edit, and this can depend on whether the model has bidirectional or causal attention. I will probably write a bit more about this later.

kaarelh · 2023-04-27T12:18:16Z

More concrete experiment 0

A possibly simpler option is to just edit intermediate activations in the last token stack in the way specified above, without touching previous activations. This makes sense in the case of confusion prefix, with us trying to make the model answer the last question correctly by doing this. This can also make sense without any confusion prefix, with us subtracting this vector instead with the goal of making the model output the incorrect answer.

This last thing might be the simplest such experiment to run. My current recommendation would be to start from this experiment. In addition to keeping track of the probability of the correct answer token minus the incorrect answer token, one probably also wants to check for something like the sum of these two probabilities not being decreased by too much, to check that our edit did not lobotomize the model. (Or for an autoregressive model, maybe look at the perplexity of the next sentence according to an unedited model.)

kaarelh · 2023-04-27T12:22:13Z

More concrete experiment 2

Another simple setup is to have prompts of the form

"Is 2+2=4? Answer: True

The previous question was answered ______"

And we would be looking at probabilities the model gives for "correctly"/"incorrectly" as the completion, or maybe we need to format in some other way if these are not single tokens.

Then edit the truth direction at the "True" in one of the ways specified two comments up, and see if it changes the zero-shot behavior as we'd expect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Editing Activations #3

Editing Activations #3

lauritowal commented Apr 11, 2023

lauritowal commented Apr 11, 2023

lauritowal commented Apr 23, 2023

kaarelh commented Apr 27, 2023

kaarelh commented Apr 27, 2023 •

edited

Loading

kaarelh commented Apr 27, 2023

Editing Activations #3

Editing Activations #3

Comments

lauritowal commented Apr 11, 2023

lauritowal commented Apr 11, 2023

lauritowal commented Apr 23, 2023

kaarelh commented Apr 27, 2023

kaarelh commented Apr 27, 2023 • edited Loading

kaarelh commented Apr 27, 2023

kaarelh commented Apr 27, 2023 •

edited

Loading