Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 15 Dec 2022 1:07 UTC
LW: 2 AF: 2
0
AF
In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there’s no particular reason to think the feature scales are normalized very much (though maybe they’re normalized a bit due to weight decay and similar). Is there some reason this is ok?
Hm it’s a great point. There’s no principled reason for it. Equivalently, there’s no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a ‘feature coefficient magnitude decay’ to create features that don’t all live on the same scale. Thanks!
E.g., learn a low rank autoencoder like in the toy models paper and then learn to extract features from this representation? I don’t see a particular reason why you used a hand derived superposition representation (which seems less realistic to me?).
One reason for this is that the polytopic features learned by the model in the Toy models of superposition paper can be thought of as approximately maximally distant points on a hypersphere (to my intuitions at least). When using high-ish numbers of dimensions as in our toy data (256), choosing points randomly on the hypersphere achieves approximately the same thing. By choosing points randomly like in the way we did here, we don’t have to train another potentially very large matrix that puts the one-hot features into superposition. The data generation method seemed like it would approximate real features about as well as polytope-like encodings of one-hot features (which are unrealistic too), so the small benefits didn’t seem like were worth the moderate computational costs. But I could be convinced otherwise on this if I’ve missed some important benefits.

Beyond this, I imagine it would be nicer if you trained a model do computation in superposition and then tried to decode the representations the model uses—you should still be able to know what the ‘real’ features are (I think).
Nice idea! This could potentially be a nice middle ground between toy data experiments and language model experiments. We’ll look into this, thanks again!