This is nice work! I’m most interested in this for reinitializing dead features. I expect you could reinit by datapoints the model is currently worse at predicting over N batches or something.
I don’t think we’re bottlenecked on compute here actually.
If dictionaries applied to real models gets to ~0 reconstruction cost, we can pay the compute cost to train lots of them for lots of models and open source them for others to studies.
I believe doing awesome work with sparse autoencoders (eg finding truth direction, understanding RLHF) will convince others to work on it as well, including lowering the compute cost. I predict that convincing 100 people to work on this 1 month sooner would be more impactful than lowering compute cost (though again, this work is also quite useful for reinitialization!)
I’m pretty concerned about the compute scaling of autoencoders to real models. I predict the scaling of the data needed and of the amount of features is super linear in d_model, which seems to scale badly to a frontier model
This doesn’t engage w/ (2) - doing awesome work to attract more researchers to this agenda is counterfactually more useful than directly working on lowering the compute cost now (since others, or yourself, can work on that compute bottleneck later).
Though honestly, if the results ended up in a ~2x speedup, that’d be quite useful for faster feedback loops for myself.
Yeah, I agree that doing work that gets other people excited about sparse autoencoders is arguably more impactful than marginal compute savings, I’m just arguing that compute savings do matter.
Thanks Logan, 1) About re-initialization: I think your idea of re-initializing dead features of the sparse dictionary with the input data the model struggle reconstructing could work. It seems a great idea! This probably imply extracting rare features vectors out of such datapoints before using them for initialization.
I intuitively suspect that the datapoints the model is bad at predicting contain rare features and potentially common rare features. Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)
2) About not being compute bottlenecked: I am a bit cautious about how well sparse autoencoders methods would scale to very high dimensionality. If the “scaling factor” estimated (with a very low confidence) in the original work is correct, then compute could become a thing.
“Here we found very weak, tentative evidence that, for a model of size dmodel=256, the number of features in superposition was over 100,000. This is a large scaling factor, and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves.”
However: - we need more evidences of this (or may be I have missed an important update about this!) - may be I’m asking too much out of it: my concerns about scaling relate to being able to recover most of the superposed features; but improving the understanding, even if it is not complete, is already a victory.
Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)
Oh no, my idea was to do the top-sorted worse reconstructed datapoints when re-initializing (or alternatively, worse perplexity when run through the full model). Since we’ll likely be re-initializing many dead features at a time, this might pick up on the same feature multiple times.
Would you cluster & then sample uniformly from the worst-k-reconstructed clusters?
2) Not being compute bottlenecked—I do assign decent probability that we will eventually be compute bottlenecked; my point here is the current bottleneck I see is the current number of people working on it. This means, for me personally, focusing on flashy, fun applications of sparse autoencoders.
[As a relative measure, we’re not compute-bottlenecked enough to learn dictionaries in the smaller Pythia-model]
This is nice work! I’m most interested in this for reinitializing dead features. I expect you could reinit by datapoints the model is currently worse at predicting over N batches or something.
I don’t think we’re bottlenecked on compute here actually.
If dictionaries applied to real models gets to ~0 reconstruction cost, we can pay the compute cost to train lots of them for lots of models and open source them for others to studies.
I believe doing awesome work with sparse autoencoders (eg finding truth direction, understanding RLHF) will convince others to work on it as well, including lowering the compute cost. I predict that convincing 100 people to work on this 1 month sooner would be more impactful than lowering compute cost (though again, this work is also quite useful for reinitialization!)
I’m pretty concerned about the compute scaling of autoencoders to real models. I predict the scaling of the data needed and of the amount of features is super linear in d_model, which seems to scale badly to a frontier model
This doesn’t engage w/ (2) - doing awesome work to attract more researchers to this agenda is counterfactually more useful than directly working on lowering the compute cost now (since others, or yourself, can work on that compute bottleneck later).
Though honestly, if the results ended up in a ~2x speedup, that’d be quite useful for faster feedback loops for myself.
Yeah, I agree that doing work that gets other people excited about sparse autoencoders is arguably more impactful than marginal compute savings, I’m just arguing that compute savings do matter.
Thanks Logan,
1) About re-initialization:
I think your idea of re-initializing dead features of the sparse dictionary with the input data the model struggle reconstructing could work. It seems a great idea!
This probably imply extracting rare features vectors out of such datapoints before using them for initialization.
I intuitively suspect that the datapoints the model is bad at predicting contain rare features and potentially common rare features. Therefore I would bet on performing some rare feature extraction out of batches of poorly reconstructed input data, instead of using directly the one with the worst reconstruction loss. (But may be this is what you already had in mind?)
2) About not being compute bottlenecked:
I am a bit cautious about how well sparse autoencoders methods would scale to very high dimensionality. If the “scaling factor” estimated (with a very low confidence) in the original work is correct, then compute could become a thing.
“Here we found very weak, tentative evidence that, for a model of size dmodel=256, the number of features in superposition was over 100,000. This is a large scaling factor, and it’s only a lower bound. If the estimated scaling factor is approximately correct (and, we emphasize, we’re not at all confident in that result yet) or if it gets larger, then this method of feature extraction is going to be very costly to scale to the largest models – possibly more costly than training the models themselves.”
However:
- we need more evidences of this (or may be I have missed an important update about this!)
- may be I’m asking too much out of it: my concerns about scaling relate to being able to recover most of the superposed features; but improving the understanding, even if it is not complete, is already a victory.
Oh no, my idea was to do the top-sorted worse reconstructed datapoints when re-initializing (or alternatively, worse perplexity when run through the full model). Since we’ll likely be re-initializing many dead features at a time, this might pick up on the same feature multiple times.
Would you cluster & then sample uniformly from the worst-k-reconstructed clusters?
2) Not being compute bottlenecked—I do assign decent probability that we will eventually be compute bottlenecked; my point here is the current bottleneck I see is the current number of people working on it. This means, for me personally, focusing on flashy, fun applications of sparse autoencoders.
[As a relative measure, we’re not compute-bottlenecked enough to learn dictionaries in the smaller Pythia-model]