There may not be any ‘clear’ technical obstruction, but it has failed badly in the past. ‘Add more parallelism’ (particularly hierarchically) is one of the most obvious ways to improve attention, and people have spent the past 5 years failing to come up with efficient attentions that do anything but move along a Pareto frontier from ‘fast but doesn’t work’ to ‘slow and works only as well as the original dense attention’. It’s just inherently difficult to know what tokens you will need across millions of tokens without input from all the other tokens (unless you are psychic), implying extensive computation of some sort, which makes things inherently serial and costs you latency, even if you are rich enough to spend compute like water. You’ll note that when Claude-2 was demoing the ultra-long attention windows, it too spent a minute or two churning. While the most effective improvements in long-range attention like Flash Attention or Ring Attention are just hyperoptimizing dense attention, which is inherently limited.
There isn’t any clear technical obstruction to getting this time down pretty small with more parallelism.
There may not be any ‘clear’ technical obstruction, but it has failed badly in the past. ‘Add more parallelism’ (particularly hierarchically) is one of the most obvious ways to improve attention, and people have spent the past 5 years failing to come up with efficient attentions that do anything but move along a Pareto frontier from ‘fast but doesn’t work’ to ‘slow and works only as well as the original dense attention’. It’s just inherently difficult to know what tokens you will need across millions of tokens without input from all the other tokens (unless you are psychic), implying extensive computation of some sort, which makes things inherently serial and costs you latency, even if you are rich enough to spend compute like water. You’ll note that when Claude-2 was demoing the ultra-long attention windows, it too spent a minute or two churning. While the most effective improvements in long-range attention like Flash Attention or Ring Attention are just hyperoptimizing dense attention, which is inherently limited.