Virtual Personas for Language Models via an Anthology of Backstories

What does it mean for large language models (LLMs) to be trained on massive text corpora, collectively produced by millions and billions of distinctive human authors? LLMs are capable of generating conditional text that represents the characteristics of an agent likely to have produced that context. This suggests that, with appropriate conditioning, LLMs could be guided to approximate the responses of a particular human voice, rather than the mixture of voices that otherwise emerges.

This raises an interesting question: How can we condition the model to reflect a distinct, individual voice with precision, amidst the vast diversity of perspectives within the text corpora?

To answer this question, we introduce Anthology, a method for conditioning LLMs to representative, consistent, and diverse virtual personas by generating and utilizing naturalistic backstories with rich details of individual values and experience. Additionally, we present methods for generating these backstories directly from LLMs, enabling efficient production of extensive sets that span a broad spectrum of human demographics. By grounding language models in detailed backstories, Anthology enhances their ability to simulate individual human profiles with greater accuracy, as measured by alignment with the distribution and consistency of authentic human responses.

Install and Download Backstories

The configuration of all experiments is handled by Hydra. The configuration files can be found in the configs directory.

To install the required packages, you can use the following command:

git clone [email protected]:CannyLab/anthology.git
cd anthology
pip install -r requirements.txt
pip install -e .

We share approximately 10,000 pre-generated backstories at this link.

Generate Backstories

To generate backstories, you can use the following command:

python scripts/s0_self_generate_backstories.py num_backstories=1000 num_workers=10

The num_backstories parameter specifies the number of backstories to generate. The num_workers parameter specifies the number of workers to use for parallel processing. The default parameter config yaml file can be found in configs/generate_backstory/config.yaml. To change the parameters, you can refer to this link: Generate Backstories Config

Load Backstories to Conduct Suvey

You might need to load the backstories to conduct downstream survey. For example, you might want to survey demographic information of the backstories. You can use the following command:

python scripts/s1_ask_demographics.py num_workers=10 backsotries.backstory_path="path_to_backstories"

The backstories.backstory_path parameter specifies the path to the backstories. The backstory path should be a directory containing the backstories.

Define the new downstream task

To define the new task and its configuration, you have to make new configuration files in the configs directory. The configuration files should be named as surveys/{task_name}_config.yaml. The configuration files should contain the following parameters:

defaults: The default parameters for the task
backstories: The configuration for the backstories
llm_parameters: The configuration for the language model to use

For example, if you want to define a new task called survey_demographics, you can create a new configuration file called surveys/demo_config.yaml. The configuration file should contain the following parameters:

defaults:
  - backstories: self_generated
  - experiments/demographics:
    - age
    - education_level
    - gender
    - race
    - income_level
  - special_prompt: consistency_prompt
  - format/mcq_symbol: uppercase
  - format/choice_format: curved_bracket
  - format/surveyor_respondent: question_answer
  - llm_parameters: openai
  - _self_

# Set the output directory
hydra:
  run:
    dir: outputs/demo_survey/${now:%Y-%m-%d}/${now:%H-%M-%S}

run_dir: ${hydra:runtime.output_dir}

# output data name
output_data_name: ${now:%Y-%m-%d}_${now:%H-%M-%S}

# Debugging mode
debug: false

# Parallelization parameter
num_processes: 1

# The seed to use for the random number generator
seed: 42

# Frequency of saving the generated backstories
freq: 5

The parameters are explained as follows:

defaults: The default parameters for the task
backstories: The yaml configuration file of the backstories. link The configuration file should contain the following parameters:
- backstory_path: The path to the backstories
- backstory_type: The type of backstories. The type can be self_generated or ANES (American National Election Studies)
- num_backstories: The number of backstories to use. If it is set to -1, all backstories will be used
- shuffle: Whether to shuffle the backstories. If it is set to true, the backstories will be shuffled.
experiments/demographics: The demographic information to survey. The demographic information can be age, education_level, gender, race, and income_level. The type of question (e.g., multiple choice, open-ended) can be specified in the configuration file. The exact wording and the available choices of each question are specifed in the configuration file.
- age: The age of the respondent link
- education_level: The education level of the respondent link
- gender: The gender of the respondent link
- race: The race of the respondent link
- income_level: The income level of the respondent link
llm_parameters: The yaml configuration file of the language model to use
special_prompt: The special prompt to use for the task (e.g., consistency prompt) Possible values are consistency_prompt, cot_prompt, no_prompt
format/mcq_symbol: The symbol to use for the multiple choice questions. Possible values are uppercase, lowercase, and number
format/choice_format: The format to use for the multiple choice questions. Possible values are curved_bracket, bracket, dot, and space
format/surveyor_respondent: The pharse to use for turn-taking between the surveyor and the respondent. Possible values are question_answer, surveyor_respondent, and interviewer_interviewee

EXAMPLES OF SCRIPT RUN

Chat models for demographic survey

python scripts/run_demographic_survey_chat.py \
    backstories.backstory_path="path_to_backstories" \
    use_prefer_not_to_answer=False \
    format/surveyor_respondent=multiturn_chat \
    save_dir="path_to_saving_directory" \
    llm_parsing_parameters.top_logprobs=20 \
    llm_parsing_parameters.logprobs=True \
    num_processes=8

Generate backstory with demographic information

python scripts/run_backstory_gen_with_demographic_information.py \
    wave=111 \
    prompt_style=portray \
    llm_parameters=mixtral2_chat_together \
    save_dir="path_to_saving_directory" \
    num_processes=8

Demographic matching

python scripts/run_demographic_matching.py --demographic_survey_method chatmodel_multiturn --demographic_survey_file batch_1+2 --human_survey ATP_W34 --matching_method greedy

Run ATP survey

python scripts/run_atp_survey.py \
    wave=34 \
    questionnaire/ATP=ATP_W34_questions \
    backstories.backstory_path="path_to_backstories" \
    llm_parameters=mixtral2_together \
    num_processes=8 \
    save_dir="path_to_saving_directory"

Template of full survey run (2 basline methods, 3 demographics-conditioned backstoires, 3 demographic trait appended natural backstories with greedy matching, 3 demographic trait appended natural backstoires with maximum weight matching, total of 11 lines)

# baseline - QA
python scripts/run_atp_baseline_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    prompt_style=qa \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# baseline - BIO
python scripts/run_atp_baseline_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    prompt_style=bio \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - demographics-primed backstories generated by QA MCQ prompt
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=DP_QA \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - demographics-primed backstories generated by PORTRAY prompt
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=DP_PORTRAY \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - demographics-primed backstories generated by BIO prompt
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm num_processes=128 \
    output_data_name=DP_BIO \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with QA MCQ method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=NA_greedy_QA_MCQ \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with QA FREEFORM method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=NA_greedy_QA_FREEFORM \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with BIO method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm num_processes=128 \
    output_data_name=NA_greedy_BIO \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with QA MCQ method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=NA_maximum_weight_QA_MCQ \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with QA FREEFORM method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm num_processes=128 \
    output_data_name=NA_maximum_weight_QA_FREEFORM \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

# anthology - natural backstories with greedy matching, demographic traits appended with BIO method
python scripts/run_atp_survey.py \
    wave=111 \
    questionnaire/ATP=ATP_W111_questions_second_new_set \
    llm_parameters=llama3_70b_vllm \
    num_processes=128 \
    output_data_name=NA_maximum_weight_BIO \
    backstories.backstory_path="path_to_backstories" \
    save_dir="path_to_saving_directory"

Running Analysis Code

python scripts/run_atp_analysis.py \
    wave=111 \
    output_data_name="output_data_name" \
    experiment_repository="path_to_experiment_repository" \
    dist_type="EMD" # or "TV"

Citation

@article{moon2024virtual,
  title={Virtual personas for language models via an anthology of backstories},
  author={Moon, Suhong and Abdulhai, Marwa and Kang, Minwoo and Suh, Joseph and Soedarmadji, Widyadewi and Behar, Eran Kohen and Chan, David M},
  journal={arXiv preprint arXiv:2407.06576},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
anthology		anthology
configs		configs
data/questions		data/questions
figs		figs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Virtual Personas for Language Models via an Anthology of Backstories

Install and Download Backstories

Generate Backstories

Load Backstories to Conduct Suvey

Define the new downstream task

EXAMPLES OF SCRIPT RUN

Running Analysis Code

Citation

About

Releases

Packages

Languages

License

CannyLab/anthology

Folders and files

Latest commit

History

Repository files navigation

Virtual Personas for Language Models via an Anthology of Backstories

Install and Download Backstories

Generate Backstories

Load Backstories to Conduct Suvey

Define the new downstream task

EXAMPLES OF SCRIPT RUN

Running Analysis Code

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages