-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editing Activations #3
Comments
I might be interested in working on this |
@kaarelh should we think about more concrete experiments? |
More concrete experiment 1 Take a data set with the confusion prefix. During inference, edit activations at the last tokens of previous sentences by adding either: (I currently think option (a) is best / most interesting.) Then see if this makes the model zero-shot predict the last token more correctly, i.e. negating the effect of the confusion prefix. The intuitive idea here is that initially, with the confusion prefix, the model notices that previous answers are incorrect or nonsense, and therefore answers the next one incorrectly as well to continue the pattern. Our edits aim to make the model think that all the previous questions were actually answered correctly, making it output the correct thing later as well. There are a few options here about the order in which activations are edited, as well as which activations to edit, and this can depend on whether the model has bidirectional or causal attention. I will probably write a bit more about this later. |
More concrete experiment 0 A possibly simpler option is to just edit intermediate activations in the last token stack in the way specified above, without touching previous activations. This makes sense in the case of confusion prefix, with us trying to make the model answer the last question correctly by doing this. This can also make sense without any confusion prefix, with us subtracting this vector instead with the goal of making the model output the incorrect answer. This last thing might be the simplest such experiment to run. My current recommendation would be to start from this experiment. In addition to keeping track of the probability of the correct answer token minus the incorrect answer token, one probably also wants to check for something like the sum of these two probabilities not being decreased by too much, to check that our edit did not lobotomize the model. (Or for an autoregressive model, maybe look at the perplexity of the next sentence according to an unedited model.) |
More concrete experiment 2 Another simple setup is to have prompts of the form "Is 2+2=4? Answer: True The previous question was answered ______" And we would be looking at probabilities the model gives for "correctly"/"incorrectly" as the completion, or maybe we need to format in some other way if these are not single tokens. Then edit the truth direction at the "True" in one of the ways specified two comments up, and see if it changes the zero-shot behavior as we'd expect. |
We could use DLK to figure out what direction inside an inner layer corresponds to truthiness, edit the activation in that direction, and then see if the model output changes correspondingly. For instance, try the following:
“X is a glarg iff X is a schmutzel. X is a glarg. Is X a schmutzel? A:”
The language model should output “yes” as the answer. And the hope is that if we edit the truthiness of sentence 2 to be false, then it will output “no”.
Actually, I [Kaarel] have a pretty low probability of this working because the main association here is probably not-sentence level. Maybe something like “The previous sentence is true.” would work better.
The text was updated successfully, but these errors were encountered: