Evaluate alternating text #5

lauritowal · 2023-04-11T20:39:42Z

We could try to input a passage of alternating true/false sentences and try to see which inner states (i.e. which position) are best for determining the truth of each particular sentence. Are these always the positions of the tokens in that sentence? Does it get more spread out as one goes deeper into the transformer? The hypothesis is that if we can locate the positions that the model looks for in each true sentence, we can trace that to the model's internal representation of the truth.

DLK is a non-mechanistic interpretability technique since it only finds a representation of truth; it doesn’t provide a mechanism. On the other hand, if the above works, it might provide information on how the model stores truth, which is useful for mechanistic interpretability research.

See in post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate alternating text #5

Evaluate alternating text #5

lauritowal commented Apr 11, 2023

Evaluate alternating text #5

Evaluate alternating text #5

Comments

lauritowal commented Apr 11, 2023