Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persona / Character #1

Open
lauritowal opened this issue Apr 11, 2023 · 0 comments
Open

Persona / Character #1

lauritowal opened this issue Apr 11, 2023 · 0 comments

Comments

@lauritowal
Copy link

One worry with DLK is that it might end up recovering the beliefs of some simulated agent (e.g. a simulated aligned AGI, or a particular human, or humanity’s scientific consensus). One idea for checking for this is to study how the LM represents the beliefs of characters it is modeling by just doing supervised learning with particular prompting and labels. For instance, get it to produce text as North Korean state news, and do supervised learning to find a representation in the model of truth-according-to-North-Korean-state-news.

If we do this for a bunch of simulated agents, maybe we can state some general conclusion about how the truth according to some simulated agent is represented, and maybe this contrasts with what is found by DLK providing evidence that DLK is not just finding the representation of truth according to some simulated agent. Or maybe we can even use this to develop a nice understanding of how language models represent concepts of simulated agents vs analogous concepts of their own, which would let us figure out e.g. the goals of a language model from understanding how goals of simulated agents are represented using supervised probing, and then using the general mapping from reps of simulatee-concepts to reps of the model’s own analogous concepts which we developed for truth (if we’re lucky and it generalizes).

See Point 13 in Additional Ideas: https://www.lesswrong.com/posts/bFwigCDMC5ishLz7X/rfc-possible-ways-to-expand-on-discovering-latent-knowledge#Additional_ideas_that_came_up_while_writing_this_post_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant