These two papers are clearly a part of Jeff Clune’s paradigm of “AI-generating algorithms”, https://arxiv.org/abs/1905.10985 (currently 123 references on Google Scholar, but a number of its derivative works have higher citation counts).
Safety concerns were raised in the referenced twitter thread and are also discussed in the paper (Section 6, page 12). As usual, the dichotomy of whether to expose the relevant capability gains or whether to avoid exposing them is quite non-trivial, so one would expect differences of opinion here. The capability gains here are rather straightforward (one does not even use GPUs on the client side, this is straightforwardly based on the ability to do LLM inference via API).
2. The workflow “train an agent with a weak LLM, then substitute a stronger LLM after training, and the performance jumps” is very pronounced here.
In particular, see Section 4.3, page 9. They synthesized a few agents on one of the ARC datasets using GPT-3.5 as the underlying LLM[1], reaching the performance of 12-14%. Then they substituted GPT-4 and Claude 3.5 Sonnet, and the performance jumped respectively to 30-37% and 38-49% without any further adjustments[2].
One should expect further gains when better future LLMs are substituted here (without further adjustments of the agents).
One can learn a lot from this paper. A couple of observations are as follows.
1. Two of its authors are also the authors of “AI scientist”, https://arxiv.org/abs/2408.06292
These two papers are clearly a part of Jeff Clune’s paradigm of “AI-generating algorithms”, https://arxiv.org/abs/1905.10985 (currently 123 references on Google Scholar, but a number of its derivative works have higher citation counts).
Safety concerns were raised in the referenced twitter thread and are also discussed in the paper (Section 6, page 12). As usual, the dichotomy of whether to expose the relevant capability gains or whether to avoid exposing them is quite non-trivial, so one would expect differences of opinion here. The capability gains here are rather straightforward (one does not even use GPUs on the client side, this is straightforwardly based on the ability to do LLM inference via API).
2. The workflow “train an agent with a weak LLM, then substitute a stronger LLM after training, and the performance jumps” is very pronounced here.
In particular, see Section 4.3, page 9. They synthesized a few agents on one of the ARC datasets using GPT-3.5 as the underlying LLM[1], reaching the performance of 12-14%. Then they substituted GPT-4 and Claude 3.5 Sonnet, and the performance jumped respectively to 30-37% and 38-49% without any further adjustments[2].
One should expect further gains when better future LLMs are substituted here (without further adjustments of the agents).
The LLM used by generated agents during training and initial evaluation. The meta process controlling the generation of agents used
gpt-4o-2024-05-13
.Those who want to look more closely at the generated agents will find the conversation in https://github.com/ShengranHu/ADAS/issues/4 helpful.