Sam Marks comments on Sam Marks’s Shortform

Sam Marks 6 Feb 2024 6:56 UTC
LW: 7 AF: 4
AF
Some updates about the dictionary_learning repo:
- The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
- ActivationBuffers now work natively with model components—like the residual stream—whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
- ActivationBuffers can now be stored on the GPU.
- The file evaluation.py contains code for evaluating trained dictionaries. I’ve found this pretty useful for quickly evaluating dictionaries people send to me.
- New convenience: you can do reconstructed_acts, features = dictionary(acts, output_features=True) to get both the reconstruction and the features computed by dictionary.
Also, if you’d like to train dictionaries for many model components in parallel, you can use the parallel branch. I don’t promise to never make breaking changes to the parallel branch, sorry.
Finally, we’ve released a new set of dictionaries for the MLP outputs, attention outputs, and residual stream in all layers of Pythia-70m-deduped. The MLP and attention dictionaries seem pretty good, and the residual stream dictionaries seem like a mixed bag. Their stats can be found here.