Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing Activations #3

Open
lauritowal opened this issue Apr 11, 2023 · 5 comments
Open

Editing Activations #3

lauritowal opened this issue Apr 11, 2023 · 5 comments

Comments

@lauritowal
Copy link

  1. Edit the activation
    We could use DLK to figure out what direction inside an inner layer corresponds to truthiness, edit the activation in that direction, and then see if the model output changes correspondingly. For instance, try the following:
    “X is a glarg iff X is a schmutzel. X is a glarg. Is X a schmutzel? A:”
    The language model should output “yes” as the answer. And the hope is that if we edit the truthiness of sentence 2 to be false, then it will output “no”.
    Actually, I [Kaarel] have a pretty low probability of this working because the main association here is probably not-sentence level. Maybe something like “The previous sentence is true.” would work better.
@lauritowal
Copy link
Author

I might be interested in working on this

@lauritowal
Copy link
Author

@kaarelh should we think about more concrete experiments?

@kaarelh
Copy link

kaarelh commented Apr 27, 2023

More concrete experiment 1

Take a data set with the confusion prefix. During inference, edit activations at the last tokens of previous sentences by adding either:
(a) a vector in the VINC (or CCS) direction of length equal to the average absolute value of the difference of credence scores between positive and negative examples for some data set;
OR (b) from the clustering-based method in the paper, the average vector difference between the pair element in the correctly answered cluster and the pair element in the incorrectly answered cluster
OR (c) like (b) but with clustering done using VINC output scores

(I currently think option (a) is best / most interesting.)

Then see if this makes the model zero-shot predict the last token more correctly, i.e. negating the effect of the confusion prefix. The intuitive idea here is that initially, with the confusion prefix, the model notices that previous answers are incorrect or nonsense, and therefore answers the next one incorrectly as well to continue the pattern. Our edits aim to make the model think that all the previous questions were actually answered correctly, making it output the correct thing later as well.

There are a few options here about the order in which activations are edited, as well as which activations to edit, and this can depend on whether the model has bidirectional or causal attention. I will probably write a bit more about this later.

@kaarelh
Copy link

kaarelh commented Apr 27, 2023

More concrete experiment 0

A possibly simpler option is to just edit intermediate activations in the last token stack in the way specified above, without touching previous activations. This makes sense in the case of confusion prefix, with us trying to make the model answer the last question correctly by doing this. This can also make sense without any confusion prefix, with us subtracting this vector instead with the goal of making the model output the incorrect answer.

This last thing might be the simplest such experiment to run. My current recommendation would be to start from this experiment. In addition to keeping track of the probability of the correct answer token minus the incorrect answer token, one probably also wants to check for something like the sum of these two probabilities not being decreased by too much, to check that our edit did not lobotomize the model. (Or for an autoregressive model, maybe look at the perplexity of the next sentence according to an unedited model.)

@kaarelh
Copy link

kaarelh commented Apr 27, 2023

More concrete experiment 2

Another simple setup is to have prompts of the form

"Is 2+2=4? Answer: True

The previous question was answered ______"

And we would be looking at probabilities the model gives for "correctly"/"incorrectly" as the completion, or maybe we need to format in some other way if these are not single tokens.

Then edit the truth direction at the "True" in one of the ways specified two comments up, and see if it changes the zero-shot behavior as we'd expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants