You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could try to input a passage of alternating true/false sentences and try to see which inner states (i.e. which position) are best for determining the truth of each particular sentence. Are these always the positions of the tokens in that sentence? Does it get more spread out as one goes deeper into the transformer? The hypothesis is that if we can locate the positions that the model looks for in each true sentence, we can trace that to the model's internal representation of the truth.
DLK is a non-mechanistic interpretability technique since it only finds a representation of truth; it doesn’t provide a mechanism. On the other hand, if the above works, it might provide information on how the model stores truth, which is useful for mechanistic interpretability research.
We could try to input a passage of alternating true/false sentences and try to see which inner states (i.e. which position) are best for determining the truth of each particular sentence. Are these always the positions of the tokens in that sentence? Does it get more spread out as one goes deeper into the transformer? The hypothesis is that if we can locate the positions that the model looks for in each true sentence, we can trace that to the model's internal representation of the truth.
DLK is a non-mechanistic interpretability technique since it only finds a representation of truth; it doesn’t provide a mechanism. On the other hand, if the above works, it might provide information on how the model stores truth, which is useful for mechanistic interpretability research.
See in post
The text was updated successfully, but these errors were encountered: