This is cool! I’d be interested to see a fully reproducible version where anybody can download and run the evaluation. Right now, it’s hard to use this methodology to compare the abilities of different language models or scaffolding systems because the results are significantly driven by human prompting skills. To compare the abilities of different models and track progress towards language agents over time, you’d want to hold the evaluation constant and only vary the models and scaffolding structures. This might also be a more accurate analogy to models which are trying to survive and spread without human assistance along the way.
To turn this into a fully reproducible evaluation, you’d need to standardize the initial prompt and allow the agent to interact with the environment without additional prompting or assistance from a human operator. (A human overseer could still veto any dangerous actions.) The agent might get stuck at an intermediate step of the task, so you’d want to decompose each task into many individual steps and check whether the agent can pass each step individually, assuming previous steps were performed correctly. You could create many different scenarios to ensure that idiosyncratic factors aren’t dominating the results—for example, maybe the model is bad at AWS, but would perform much better with Azure or GCP. This benchmark runs this kind of automatic evaluation, but it doesn’t test many interesting tasks related to the survive and spread threat model.
Kudos for doing independent replication work, reproducibility is really important for allowing other people to build on existing work.
Thank you for the kind comment! You have lots of good ideas for how to improve this. I especially like the idea of testing with different cloud providers. I could add programming languages in there: Maybe GPT-4 is better at writing Node.JS than Python (the language I prompted it to use).
I agree, a fully reproducible version would have benefits. Differences in prompt quality between evaluations is a problem.
Also agreed that it’s important to allow the agent to try and complete the tasks without assistance. I did that for this reproduction. The only changes I made to the agent’s commands were to restrict it to accessing files in a particular directory on my computer.
I’ve hesitated to open-source my code. I don’t want to accidentally advance the frontier of language model agents. But like I said in another comment, my code and prompts are pretty simple and don’t use any techniques that aren’t available elsewhere on the internet. So maybe it isn’t a big deal. Curious to hear what you think.
I wouldn’t recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.
This is cool! I’d be interested to see a fully reproducible version where anybody can download and run the evaluation. Right now, it’s hard to use this methodology to compare the abilities of different language models or scaffolding systems because the results are significantly driven by human prompting skills. To compare the abilities of different models and track progress towards language agents over time, you’d want to hold the evaluation constant and only vary the models and scaffolding structures. This might also be a more accurate analogy to models which are trying to survive and spread without human assistance along the way.
To turn this into a fully reproducible evaluation, you’d need to standardize the initial prompt and allow the agent to interact with the environment without additional prompting or assistance from a human operator. (A human overseer could still veto any dangerous actions.) The agent might get stuck at an intermediate step of the task, so you’d want to decompose each task into many individual steps and check whether the agent can pass each step individually, assuming previous steps were performed correctly. You could create many different scenarios to ensure that idiosyncratic factors aren’t dominating the results—for example, maybe the model is bad at AWS, but would perform much better with Azure or GCP. This benchmark runs this kind of automatic evaluation, but it doesn’t test many interesting tasks related to the survive and spread threat model.
Kudos for doing independent replication work, reproducibility is really important for allowing other people to build on existing work.
Thank you for the kind comment! You have lots of good ideas for how to improve this. I especially like the idea of testing with different cloud providers. I could add programming languages in there: Maybe GPT-4 is better at writing Node.JS than Python (the language I prompted it to use).
I agree, a fully reproducible version would have benefits. Differences in prompt quality between evaluations is a problem.
Also agreed that it’s important to allow the agent to try and complete the tasks without assistance. I did that for this reproduction. The only changes I made to the agent’s commands were to restrict it to accessing files in a particular directory on my computer.
I’ve hesitated to open-source my code. I don’t want to accidentally advance the frontier of language model agents. But like I said in another comment, my code and prompts are pretty simple and don’t use any techniques that aren’t available elsewhere on the internet. So maybe it isn’t a big deal. Curious to hear what you think.
I wouldn’t recommend open sourcing any state of the art LLM agents. But if you open source the evaluation, that would provide most of the benefits (letting labs evaluate their models on your benchmark, helping people build new benchmarks, allowing researchers to build safer agents which reject dangerous actions) while avoiding the capabilities externalities of open sourcing a SOTA language agent.