The only frustrating part about all this is that we’ve seen virtually nothing done with agents in the past year, despite every major lab from OpenAI to DeepMind to Anthropic to Baidu admitting that not only is it the next step but that they’re already training models to use them. We’ve seen very few agentic model released, most notably Devin in the spring, and even then that only got a very limited release (likely due to server costs, since every codemonkey worth their salt will want to use it, and fifty million of them accessing Devin at once would crash the thing)
That’s not true. What is true is that agentic releases are only moderately advertised (especially comparing to LLMs). So they are underrepresented in the information space.
There are plenty of agents better than Devin on SWE-bench, and some of them are open source, so one can deploy them independently. The recent “AI scientist” work was done with the help of one of these open source systems (specifically, with Aider, which was the leader 3 months ago, but which has been surpassed by many others since then).
But what one indeed wonders about is whether there are much stronger agentic systems which remain undisclosed. Generally speaking, so far it seems like main progress in agentic systems is done by clever algorithmic innovation, and not by brute force training compute, so there is way more room for players of different sizes in this space.
That’s not true. What is true is that agentic releases are only moderately advertised (especially comparing to LLMs). So they are underrepresented in the information space.
But there are plenty of them. Consider the systems listed on https://www.swebench.com/.
There are plenty of agents better than Devin on SWE-bench, and some of them are open source, so one can deploy them independently. The recent “AI scientist” work was done with the help of one of these open source systems (specifically, with Aider, which was the leader 3 months ago, but which has been surpassed by many others since then).
And there are agents which are better than those presented on that leaderboard, e.g. Cosine’s Genie, see https://cosine.sh/blog/genie-technical-report and also a related section in https://openai.com/index/gpt-4o-fine-tuning/.
If one considers GAIA benchmark, https://arxiv.org/abs/2311.12983 and https://huggingface.co/spaces/gaia-benchmark/leaderboard, the three leaders are all agentic.
But what one indeed wonders about is whether there are much stronger agentic systems which remain undisclosed. Generally speaking, so far it seems like main progress in agentic systems is done by clever algorithmic innovation, and not by brute force training compute, so there is way more room for players of different sizes in this space.