Oh yeah this is great, thanks! For people reading this, I’ll highlight SLT + developmental interp + mamba as areas which I think are large enough to have specific exercise sections but currently don’t
CallumMcDougall
AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
How ARENA course material gets made
Thanks!! Really appreciate it
Thanks so much! (-:
A Selection of Randomly Selected SAE Features
Thanks so much, really glad to hear it’s been helpful!
SAE-VIS: Announcement Post
Thanks, really appreciate this (and the advice for later posts!)
Mech Interp Challenge: January—Deciphering the Caesar Cipher Model
Yep, definitely! If you’re using MSE loss then it’s got a pretty straightforward to use backprop to see how importance relates to the loss function. Also if you’re interested, I think Redwood’s paper on capacity (which is the same as what Anthropic calls dimensionality) look at derivative of loss wrt the capacity assigned to a given feature
Thanks (-:
Good question! In the first batch of exercises (replicating toy models of interp), we play around with different importances. There are some interesting findings here (e.g. when you decrease sparsity to the point where you no longer represent all features, it’s usually the lower-importance features which collapse first). I chose not to have the SAE exercises use varying importance, although it would be interesting to play around with this and see what you get!
As for what importance represents, it’s basically a proxy for “how much a certain feature reduces loss, when it actually is present.” This can be independent from feature probability. Anthropic included it in their toy models paper in order to make those models truer to reality, in the hope that the setup could tell us more interesting lessons about actual models. From the TMS paper:
Not all features are equally useful to a given task. Some can reduce the loss more than others. For an ImageNet model, where classifying different species of dogs is a central task, a floppy ear detector might be one of the most important features it can have. In contrast, another feature might only very slightly improve performance.
If we’re talking features in language models, then importance would be “average amount that this feature reduces cross entropy loss”. I open-sourced an SAE visualiser which you can find here. You can navigate through it and look at the effect of features on loss. It doesn’t actually show the “overall importance” of a feature, but you should be able to get an idea of the kinds of situations where a feature is super loss-reducing and when it isn’t. Example of a highly loss-reducing feature: feature #8, which fires on Django syntax and strongly predicts the “django” token. This seems highly loss-reducing because (although sparse) it’s very often correct when it fires with high magnitude. On the other hand, feature #7 seems less loss-reducing, because a lot of the time it’s pushing for something incorrect (maybe there exist other features which balance it out).
Interpretability with Sparse Autoencoders (Colab exercises)
Winner = highest-quality solution over the time period of a month (solutions get posted at the start of the next month, along with a new problem).
Note that we’re slightly de-emphasising the competition side now that there are occasional hints which get dropped during the month in the Slack group. I’ll still credit the best solution in the Slack group & next LW post, but the choice to drop hints was to make the problem more accessible and hopefully increase the overall reach of this series.
I don’t know of specific examples, but this is the image I have in my head when thinking about why untied weights are more free than tied weights:
I think more generally this is why I think studying SAEs in the TMS setup can be a bit challenging, because there’s often too much symmetry and not enough complexity for untied weights to be useful, meaning just forcing your weights to be tied can fix a lot of problems! (We include it in ARENA mostly for illustration of key concepts, not because it gets you many super informative results). But I’m keen for more work like this trying to understand feature absorption better in more tractible cases