One of non-obvious but very important skills which all LLM-based SWE agents currently lack is reliably knowing which subtasks of a task you have successfully solved and which you have not. I think https://www.answer.ai/posts/2025-01-08-devin.html is a good case in point.
We have absolutely seen a lot of progress on driving down hallucinations on longer and longer contexts with model scaling, they probably made the charts above possible in the first place. However, recent research (e. g., the NoLiMa benchmark from last month https://arxiv.org/html/2502.05167v1) demonstrates that effective context length falls far short of what is advertised. I assume it’s not just my personal experience but common knowledge among the practitioners that hallucinations become worse the more text you feed to an LLM.
If I’m not mistaken even with all the optimizations and “efficient” transformer attempts we are still stuck (since GPT-2 at least) with self-attention + KV-cache[1] which scale (at inference) linearly as long as you haven’t run out of memory and quadratically afterwards. Sure, MLA have just massively ramped up the context length at which the latter happens but it’s not unlimited, you won’t be able to cache, say, one day of work (especially since DRAM has not been scaling exponentially for years https://semianalysis.substack.com/p/the-memory-wall).
People certainly will come up with ways to optimize long-context performance further, but it doesn’t have to continue scaling in the same way it has since 2019.
- ^
Originally known as “past cache” after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it’s entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this
Aren’t you supposed as a reviewer to first give the authors a chance to write a rebuttal and discuss it with them before making your criticism public?