I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < −0.9, and 1314 with cosine sim < −0.7.
SAE/Transcoder activation shuffling
I’m confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I’d guess is probably substantial.
How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?
It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I’d guess is probably substantial.
The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4. Section 2.1 also says you use a context length of 64 tokens. I’d be very surprised if using 131,072⁄64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn’t be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency.
Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models?
I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!
I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072⁄64 is substantially less efficient than 131072.
We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.
When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).
Some takes on some of these research questions:
I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < −0.9, and 1314 with cosine sim < −0.7.
I’m confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I’d guess is probably substantial.
It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
Thanks Leo, very helpful!
The SAEs in your paper were trained with batch size of 131,072 tokens according to appendix A.4. Section 2.1 also says you use a context length of 64 tokens. I’d be very surprised if using 131,072⁄64 blocks of consecutive tokens was much less efficient than 131,072 tokens randomly sampled from a very large dataset. I also wouldn’t be surprised if 131,072/2048 blocks of consecutive tokens (i.e. a full context length) had similar efficiency.
Were your preliminary experiments and intuition based on batch sizes this large or were you looking at smaller models?
I missed that appendix C.1 plot showing the dead latent drop with tied init. Nice!
I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072⁄64 is substantially less efficient than 131072.
We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.
When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).