Aidan Ewart comments on Sparse Autoencoders: Future Work

Aidan Ewart 8 Oct 2023 11:15 UTC
LW: 6 AF: 3
2
AF
Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.
- Neel Nanda 8 Oct 2023 17:28 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Awesome! On the residual stream, or also on the MLP/attention outputs? I think both would be great if you have the resources, I expect there’s a lot of interest in both and how they interact. (IMO the Anthropic paper training on MLP activations is equivalent to training it on the MLP layer output, just with 4x the parameters). Ideally if you’re doing it on attn_out you could instead do it on the mixed value (z in TransformerLens) which has the same dimensionality, but makes it super clear which head the dictionary is looking at, and is robust to head superposition.
  
  Having a clean codebase also seems like a really useful resource, esp if you implement some of the tricks from the Anthropic paper, like neuron resampling. Looking forwards to it!
  - Logan Riggs 8 Oct 2023 20:00 UTC
    LW: 5 AF: 3
    0
    AF Parent
    I actually do have some publicly hosted, only on residual stream and some simple training code.
    I’m wanting to integrate some basic visualizations (and include Antrhopic’s tricks) before making a public post on it, but currently:
    Dict on pythia-70m-deduped
    Dict on Pythia-410m-deduped
    Which can be downloaded & interpreted with this notebook
    With easy training code for bespoke models here.