Nice! There’s definitely been this feeling with training SAEs that activation penalty+reconstruction loss is “not actually asking the computer for what we want,” leading to fragility. TopK seems like it’s a step closer to the ideal—did you subjectively feel confident when starting off large training runs?
Confused about section 5.3.1:
To mitigate this issue, we sum multiple TopK losses with different values of k (Multi-TopK). For example, using L(k) + L(4k)/8 is enough to obtain a progressive code over all k′ (note however that training with Multi-TopK does slightly worse than TopK at k). Training with the baseline ReLU only gives a progressive code up to a value that corresponds to using all positive latents.
Why would we want a progressive code over all hidden activations? If features have different meanings when they’re positive versus when they’re negative (imagining a sort of Toy Model of Superposition picture where features are a bunch of rays squeezed in around a central point), it seems like if your negative hidden activations are informative something weird is going on.
We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.
One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don’t really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.
Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)
Nice! There’s definitely been this feeling with training SAEs that activation penalty+reconstruction loss is “not actually asking the computer for what we want,” leading to fragility. TopK seems like it’s a step closer to the ideal—did you subjectively feel confident when starting off large training runs?
Confused about section 5.3.1:
Why would we want a progressive code over all hidden activations? If features have different meanings when they’re positive versus when they’re negative (imagining a sort of Toy Model of Superposition picture where features are a bunch of rays squeezed in around a central point), it seems like if your negative hidden activations are informative something weird is going on.
We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.
One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don’t really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.
Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)