You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@cmathw has put together a notebook in the experiments repo which performs direct logit attribution on the "produce the first path token" task. We'd like to extend this to more tasks, and the best interface would probably be along the lines of
extending the SolvedMaze class might also be an option, but it seems probably easiest to literally just pass lists of tuples of whatever tokens the task consists of.
currently, logit attribution in the notebook measures the importance of various blocks on the task of correctly predicting the first token after the path_start token, which should just be copying from the specification of the path start
another option is to try the same on the task of producing the path_end token, if the current token matches the target node
unclear what other sorts of tasks make sense -- hallway following? picking correct fork?
probably makes sense to make evals which pair with any task we want to do logit evals. is it reasonable to reuse code between these two areas?
@mivanit @cmathw We discussed this in the research meeting yesterday and I think one of you wrote down some details - can you fill in here?
The text was updated successfully, but these errors were encountered: