leogao comments on A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

leogao 19 Jul 2024 17:04 UTC
3 points
2
I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), ¹³¹⁰⁷²⁄₆₄ is substantially less efficient than 131072.

We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.

When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
- Dan Braun 20 Jul 2024 19:20 UTC
  1 point
  0
  Parent
  Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).