Skip to content

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

License

Notifications You must be signed in to change notification settings

yuzhaouoe/SAE-based-representation-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering


This repository hosts the data and code of the paper: Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering


Project Setup

conda create -n spare python=3.9 -y
conda activate spare
bash ./scripts/install.sh

Run SpARE

python ./demo.py

Test your cases by replacing test_examples.

SpARE currently only supports short-form ODQA task, and we plan to add support for more tasks in the next version.

Run Experiments

Use the cached intermediate data to run experiments.

The cached data is in the cache_data folder, including mutual information, expectation, and the values of functional SAE activations.

bash ./scripts/run_all_experiments.sh

Run SpARE Step by Step

Observe the outputs of prompts and group them based on the knowledge selection behaviours:

bash ./scripts/run_group_prompts.sh

Save the activations of grouped prompts:

bash ./scripts/run_save_grouped_activations.sh

Estimate the mutual information and expectations for each SAE activation:

bash ./scripts/run_mutual_information_and_expectations.sh

Evaluate SpARE

python ./scripts/run_spare.py \
  --model_path="meta-llama/Llama-2-7b-hf" \
  --data_name="nqswap" \
  --layer_ids 12 13 14 15 \
  --edit_degree=2.0 \
  --select_topk_proportion=0.07 \
  --seed=42 \
  --hiddens_name="grouped_activations" \
  --mutual_information_save_name="mutual_information" \
  --run_use_parameter \
  --run_use_context

Acknowledgement

The implementation of the sparse auto-encoder is adapted from EleutherAI/sae https://github.com/EleutherAI/sae and jbloomAus/SAELens https://github.com/jbloomAus/SAELens. We appreciate their open-source contributions!

Citing

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

@misc{zhao2024steeringknowledgeselectionbehaviours,
      title={Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering}, 
      author={Yu Zhao and Alessio Devoto and Giwon Hong and Xiaotang Du and Aryo Pradipta Gema and Hongru Wang and Xuanli He and Kam-Fai Wong and Pasquale Minervini},
      year={2024},
      eprint={2410.15999},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.15999}, 
}

The preliminary study: Analysing the Residual Stream of Language Models Under Knowledge Conflicts

@misc{zhao2024analysingresidualstreamlanguage,
      title={Analysing the Residual Stream of Language Models Under Knowledge Conflicts}, 
      author={Yu Zhao and Xiaotang Du and Giwon Hong and Aryo Pradipta Gema and Alessio Devoto and Hongru Wang and Xuanli He and Kam-Fai Wong and Pasquale Minervini},
      year={2024},
      eprint={2410.16090},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.16090}, 
}

About

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published