I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072⁄64 is substantially less efficient than 131072.
We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.
When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).
I’m 80% that with optimal hyperparameters for both (you need to retune hparams when you change batch size), 131072⁄64 is substantially less efficient than 131072.
We find that at a batch size of 131072, when hyperparameters are tuned, then the training curves as a function of number of tokens are roughly the same as with a batch size of 4096 (see appendix A.4). So it is not the case that 131072 is in a degenerate large batch regime where efficiency is substantially degraded by batch size.
When your batch is not fully iid, this is like effectively having a smaller batch size of iid data (in the extreme, if your batch contains 64 copies of the same data, this is obviously the same as a 64x smaller batch size), but you still pay the compute cost of putting all 131072 tokens through the model.
Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).