Cool work! Do you/any others have plans to train and open source some sparse autoencoders on some open source LLMs? (Eg smaller Pythia or GPT-2 models). Seems like a cool resource to exist, that might help enable some of this work.
Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.
Awesome! On the residual stream, or also on the MLP/attention outputs? I think both would be great if you have the resources, I expect there’s a lot of interest in both and how they interact. (IMO the Anthropic paper training on MLP activations is equivalent to training it on the MLP layer output, just with 4x the parameters). Ideally if you’re doing it on attn_out you could instead do it on the mixed value (z in TransformerLens) which has the same dimensionality, but makes it super clear which head the dictionary is looking at, and is robust to head superposition.
Having a clean codebase also seems like a really useful resource, esp if you implement some of the tricks from the Anthropic paper, like neuron resampling. Looking forwards to it!
Cool work! Do you/any others have plans to train and open source some sparse autoencoders on some open source LLMs? (Eg smaller Pythia or GPT-2 models). Seems like a cool resource to exist, that might help enable some of this work.
Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.
Awesome! On the residual stream, or also on the MLP/attention outputs? I think both would be great if you have the resources, I expect there’s a lot of interest in both and how they interact. (IMO the Anthropic paper training on MLP activations is equivalent to training it on the MLP layer output, just with 4x the parameters). Ideally if you’re doing it on attn_out you could instead do it on the mixed value (z in TransformerLens) which has the same dimensionality, but makes it super clear which head the dictionary is looking at, and is robust to head superposition.
Having a clean codebase also seems like a really useful resource, esp if you implement some of the tricks from the Anthropic paper, like neuron resampling. Looking forwards to it!
I actually do have some publicly hosted, only on residual stream and some simple training code.
I’m wanting to integrate some basic visualizations (and include Antrhopic’s tricks) before making a public post on it, but currently:
Dict on pythia-70m-deduped
Dict on Pythia-410m-deduped
Which can be downloaded & interpreted with this notebook
With easy training code for bespoke models here.