There has been work structuring batches like this! But as far as I know only with deliberately provided external memory, rather than trying to rely on the sort of innate-recent-recall Transformers might have.
Specifically, if you look at page 4 of Memorizing Transformers it pretty much has exactly your chart. Memorizing transformers uses an approximation to KNN as a non-differentiable substitute for attention is a handful of layers, and gets much much longer effective context length with this.
This (or something like this) might be behind Anthropic’s 100k attention length, particularly because you can add this to a pre-trained Transformer and have it just work—or it might not, there’s a bunch of ways to try to extend effective attention.
(I don’t thiiiink this would work very well without some kind of addition to transformer architecture, because I don’t think the training process in batch 2 will teach it how to access whatever was changed in the weights by batch 1.)
There has been work structuring batches like this! But as far as I know only with deliberately provided external memory, rather than trying to rely on the sort of innate-recent-recall Transformers might have.
Specifically, if you look at page 4 of Memorizing Transformers it pretty much has exactly your chart. Memorizing transformers uses an approximation to KNN as a non-differentiable substitute for attention is a handful of layers, and gets much much longer effective context length with this.
This (or something like this) might be behind Anthropic’s 100k attention length, particularly because you can add this to a pre-trained Transformer and have it just work—or it might not, there’s a bunch of ways to try to extend effective attention.
(I don’t thiiiink this would work very well without some kind of addition to transformer architecture, because I don’t think the training process in batch 2 will teach it how to access whatever was changed in the weights by batch 1.)
Thanks, that was very informative. I’ll be tinkering with it as I upskill on LLMs.