Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).
Thanks for prediction. Perhaps I’m underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn’t care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you’d get a bigger efficiency hit if you didn’t shuffle when you could have (assuming fixed batch size).