Stick a sparse autoencoder on the residual stream in each block.
Share weights across autoencoder instances across all blocks.
Train autoencoder during model pretraining.
Allow the gradients from autoencoder loss to flow into the rest of the model.
Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:
Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block’s feature to “mean” the same thing as a later block’s same feature when they’re at different parts of execution?
How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they’re not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
If forcing a shared representation doesn’t harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to “waste” some when the block doesn’t need it. Or the shared “features” mean something completely different at different points in execution.
Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It’d be a little surprising if there was effectively no sharing.
Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn’t technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying “yeah, that sure looks similar.”)
Quarter-baked experiment:
Stick a sparse autoencoder on the residual stream in each block.
Share weights across autoencoder instances across all blocks.
Train autoencoder during model pretraining.
Allow the gradients from autoencoder loss to flow into the rest of the model.
Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:
Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block’s feature to “mean” the same thing as a later block’s same feature when they’re at different parts of execution?
How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they’re not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
If forcing a shared representation doesn’t harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to “waste” some when the block doesn’t need it. Or the shared “features” mean something completely different at different points in execution.
Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It’d be a little surprising if there was effectively no sharing.
Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn’t technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying “yeah, that sure looks similar.”)