Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

Access our paper on ArXiv.

📑 Table of Contents

Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

🔍 About

We evaluate different open-source Sparse Autoencoders for GPT-2 small by different organisations, specifically by OpenAI, Apollo Research, and Joseph Bloom on the RAVEL dataset. We compare them against neurons and DAS based on how much they are able to disentangle the concept in neurons or latent space.

📊 Result

The below graphs show the performance:

⚙️ Setup

🔴 NOTE: The run.sh file contains the files to be run and should be edited to run for particular layer. The arguments of the script shell can be mapped using the arguments in the code. It is advisable to make a new environment before running the any files.

First clone the repository:

git clone https://github.com/MaheepChaudhary/SAE-Ravel.git

To download different SAEs and set up the environment, one can run:

chmod +x setup.sh run.sh eval_run.sh
./setup.sh

We ran the evaluation for 6 SAEs for the SAE for the Apollo research could be download just by changing id of wandb inside the code. These are the following ids of 6 SAEs:

Layer 1 e2e SAE: bst0prdd
Layer 1 e2e+ds SAE: e26jflpq
Layer 5 e2e SAE: tvj2owza
Layer 5 e2e+ds SAE: 2lzle2f0
Layer 9 e2e SAE: vnfh4vpi
Layer 9 e2e+ds SAE: u50mksr8

🏋️ Training

For training the mask for models or DAS, one can run the command:

./run.sh

📈 Evaluation

The evaluation of SAE for their quality in terms of loss and accuracy can be executed using the command:

./eval_run.sh

📂 Directory Structure

Starting with the folders, the ./data/ folder contains all the data prepared and the .py files used for it. The ./figure/ folder contains all the related images. The ./saved_models/ is just a proxy folder where the models when saved are located.

The individual files have the following meaning:

imports.py: contains all the libraries and modules to be imported
models.py: contains all the code for model preparation where intervention is being performed, apart from that it also contains the code for evaluating the SAEs.
main.py: Runs the code in models.py for training the mask for every models and DAS, while doing intervention.
eval_sae.py: contains the code for running the evaluation function in models.py.
visualisation.py: contains the code for creating graphs.
setup.sh: contains the code to setup the environment and downloading the needed SAEs.
run.sh: contains the code to run the script for running the files for training.
eval_run.sh: contains the code to running the SAE evaluation files.

📚 Citation

If you find this repository useful in your research, please consider citing our paper:

@misc{chaudhary2024evaluatingopensourcesparseautoencodersongpt2small,
      title={Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small}, 
      author={Maheep Chaudhary and Atticus Geiger},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

📑 Table of Contents

🔍 About

📊 Result

⚙️ Setup

🏋️ Training

📈 Evaluation

📂 Directory Structure

📚 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Official Repository for "Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small"

📑 Table of Contents

🔍 About

📊 Result

⚙️ Setup

🏋️ Training

📈 Evaluation

📂 Directory Structure

📚 Citation