Thank you for the kind comment! You have lots of good ideas for how to improve this. I especially like the idea of testing with different cloud providers. I could add programming languages in there: Maybe GPT-4 is better at writing Node.JS than Python (the language I prompted it to use).
I agree, a fully reproducible version would have benefits. Differences in prompt quality between evaluations is a problem.
Also agreed that it’s important to allow the agent to try and complete the tasks without assistance. I did that for this reproduction. The only changes I made to the agent’s commands were to restrict it to accessing files in a particular directory on my computer.
I’ve hesitated to open-source my code. I don’t want to accidentally advance the frontier of language model agents. But like I said in another comment, my code and prompts are pretty simple and don’t use any techniques that aren’t available elsewhere on the internet. So maybe it isn’t a big deal. Curious to hear what you think.
I wouldn’t recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.
Thank you for the kind comment! You have lots of good ideas for how to improve this. I especially like the idea of testing with different cloud providers. I could add programming languages in there: Maybe GPT-4 is better at writing Node.JS than Python (the language I prompted it to use).
I agree, a fully reproducible version would have benefits. Differences in prompt quality between evaluations is a problem.
Also agreed that it’s important to allow the agent to try and complete the tasks without assistance. I did that for this reproduction. The only changes I made to the agent’s commands were to restrict it to accessing files in a particular directory on my computer.
I’ve hesitated to open-source my code. I don’t want to accidentally advance the frontier of language model agents. But like I said in another comment, my code and prompts are pretty simple and don’t use any techniques that aren’t available elsewhere on the internet. So maybe it isn’t a big deal. Curious to hear what you think.
I wouldn’t recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.