You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One worry with DLK is that it might end up recovering the beliefs of some simulated agent (e.g. a simulated aligned AGI, or a particular human, or humanity’s scientific consensus). One idea for checking for this is to study how the LM represents the beliefs of characters it is modeling by just doing supervised learning with particular prompting and labels. For instance, get it to produce text as North Korean state news, and do supervised learning to find a representation in the model of truth-according-to-North-Korean-state-news.
If we do this for a bunch of simulated agents, maybe we can state some general conclusion about how the truth according to some simulated agent is represented, and maybe this contrasts with what is found by DLK providing evidence that DLK is not just finding the representation of truth according to some simulated agent. Or maybe we can even use this to develop a nice understanding of how language models represent concepts of simulated agents vs analogous concepts of their own, which would let us figure out e.g. the goals of a language model from understanding how goals of simulated agents are represented using supervised probing, and then using the general mapping from reps of simulatee-concepts to reps of the model’s own analogous concepts which we developed for truth (if we’re lucky and it generalizes).
One worry with DLK is that it might end up recovering the beliefs of some simulated agent (e.g. a simulated aligned AGI, or a particular human, or humanity’s scientific consensus). One idea for checking for this is to study how the LM represents the beliefs of characters it is modeling by just doing supervised learning with particular prompting and labels. For instance, get it to produce text as North Korean state news, and do supervised learning to find a representation in the model of truth-according-to-North-Korean-state-news.
If we do this for a bunch of simulated agents, maybe we can state some general conclusion about how the truth according to some simulated agent is represented, and maybe this contrasts with what is found by DLK providing evidence that DLK is not just finding the representation of truth according to some simulated agent. Or maybe we can even use this to develop a nice understanding of how language models represent concepts of simulated agents vs analogous concepts of their own, which would let us figure out e.g. the goals of a language model from understanding how goals of simulated agents are represented using supervised probing, and then using the general mapping from reps of simulatee-concepts to reps of the model’s own analogous concepts which we developed for truth (if we’re lucky and it generalizes).
See Point 13 in Additional Ideas: https://www.lesswrong.com/posts/bFwigCDMC5ishLz7X/rfc-possible-ways-to-expand-on-discovering-latent-knowledge#Additional_ideas_that_came_up_while_writing_this_post_
The text was updated successfully, but these errors were encountered: