Skip to content

Latest commit

 

History

History
103 lines (77 loc) · 2.98 KB

File metadata and controls

103 lines (77 loc) · 2.98 KB

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering


This repository hosts the data and code of the paper: Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering


Project Setup

conda create -n spare python=3.9 -y
conda activate spare
bash ./scripts/install.sh

Run SpARE

python ./demo.py

Test your cases by replacing test_examples.

SpARE currently only supports short-form ODQA task, and we plan to add support for more tasks in the next version.

Run Experiments

Use the cached intermediate data to run experiments.

The cached data is in the cache_data folder, including mutual information, expectation, and the values of functional SAE activations.

bash ./scripts/run_all_experiments.sh

Run SpARE Step by Step

Observe the outputs of prompts and group them based on the knowledge selection behaviours:

bash ./scripts/run_group_prompts.sh

Save the activations of grouped prompts:

bash ./scripts/run_save_grouped_activations.sh

Estimate the mutual information and expectations for each SAE activation:

bash ./scripts/run_mutual_information_and_expectations.sh

Evaluate SpARE

python ./scripts/run_spare.py \
  --model_path="meta-llama/Llama-2-7b-hf" \
  --data_name="nqswap" \
  --layer_ids 12 13 14 15 \
  --edit_degree=2.0 \
  --select_topk_proportion=0.07 \
  --seed=42 \
  --hiddens_name="grouped_activations" \
  --mutual_information_save_name="mutual_information" \
  --run_use_parameter \
  --run_use_context

Acknowledgement

The implementation of the sparse auto-encoder is adapted from EleutherAI/sae https://github.com/EleutherAI/sae and jbloomAus/SAELens https://github.com/jbloomAus/SAELens. We appreciate their open-source contributions!

Citing

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

@misc{zhao2024steeringknowledgeselectionbehaviours,
      title={Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering}, 
      author={Yu Zhao and Alessio Devoto and Giwon Hong and Xiaotang Du and Aryo Pradipta Gema and Hongru Wang and Xuanli He and Kam-Fai Wong and Pasquale Minervini},
      year={2024},
      eprint={2410.15999},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.15999}, 
}

The preliminary study: Analysing the Residual Stream of Language Models Under Knowledge Conflicts

@misc{zhao2024analysingresidualstreamlanguage,
      title={Analysing the Residual Stream of Language Models Under Knowledge Conflicts}, 
      author={Yu Zhao and Xiaotang Du and Giwon Hong and Aryo Pradipta Gema and Alessio Devoto and Hongru Wang and Xuanli He and Kam-Fai Wong and Pasquale Minervini},
      year={2024},
      eprint={2410.16090},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.16090}, 
}