For what it’s worth, if what you’re carefully not discussing actually is related to context length, quite a lot of people have put quite a lot of work into attempting to make context lengths longer/long context lengths more efficient, by a wide variety of means, some of which looked very plausible when written up as a paper, but very few have had much actual effect, so far. Generally the ones that have helped have not changed the actual structure of how attention works, just been caching mechanisms that made it more efficient to implement on current hardware. Typically the effect on capabilities hasn’t been to make pretraining much more efficient (since pretraining tends to be done with fairly short context lengths, with longer context lengths added later by finetuning), but just to make inference cheaper (which is rather a smaller capabilities effect).
Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.
However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.
For what it’s worth, if what you’re carefully not discussing actually is related to context length, quite a lot of people have put quite a lot of work into attempting to make context lengths longer/long context lengths more efficient, by a wide variety of means, some of which looked very plausible when written up as a paper, but very few have had much actual effect, so far. Generally the ones that have helped have not changed the actual structure of how attention works, just been caching mechanisms that made it more efficient to implement on current hardware. Typically the effect on capabilities hasn’t been to make pretraining much more efficient (since pretraining tends to be done with fairly short context lengths, with longer context lengths added later by finetuning), but just to make inference cheaper (which is rather a smaller capabilities effect).
Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.
However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.