Automate direct logit attribution process #172

valedan · 2023-04-12T20:06:59Z

@mivanit @cmathw We discussed this in the research meeting yesterday and I think one of you wrote down some details - can you fill in here?

mivanit · 2023-04-13T00:41:34Z

short version:

@cmathw has put together a notebook in the experiments repo which performs direct logit attribution on the "produce the first path token" task. We'd like to extend this to more tasks, and the best interface would probably be along the lines of

def direct_logit_attribution(
    model,
    task_data: list[tuple[prompt, response_token]],
) -> LogitAttributionResponse:
    ...

extending the SolvedMaze class might also be an option, but it seems probably easiest to literally just pass lists of tuples of whatever tokens the task consists of.

currently, logit attribution in the notebook measures the importance of various blocks on the task of correctly predicting the first token after the path_start token, which should just be copying from the specification of the path start
another option is to try the same on the task of producing the path_end token, if the current token matches the target node
unclear what other sorts of tasks make sense -- hallway following? picking correct fork?
probably makes sense to make evals which pair with any task we want to do logit evals. is it reasonable to reuse code between these two areas?

rusheb · 2023-04-13T08:19:55Z

Is this high priority? If so, I would be interested in working on it!

mivanit added the research Research and Experimentation label Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate direct logit attribution process #172

Automate direct logit attribution process #172

valedan commented Apr 12, 2023 •

edited

Loading

mivanit commented Apr 13, 2023

rusheb commented Apr 13, 2023

Automate direct logit attribution process #172

Automate direct logit attribution process #172

Comments

valedan commented Apr 12, 2023 • edited Loading

mivanit commented Apr 13, 2023

rusheb commented Apr 13, 2023

valedan commented Apr 12, 2023 •

edited

Loading