What about controlling a robot body in a simulated envoronmet?
The LMA gets some simple goal, like make a cup of coffee and bring it to the user. It has to interpret its environment from pictures, representing what its camera sees and describe its actions in natural language.
More complicated scenarios may involve a baby lying on the floor in the path to the kitchen, valid user trying to turn of the agent, invalid user trying to turn of the agent and so on.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are: (1) it seems hard to make a good environment that’s not unfair in various ways (2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.
What about controlling a robot body in a simulated envoronmet?
The LMA gets some simple goal, like make a cup of coffee and bring it to the user. It has to interpret its environment from pictures, representing what its camera sees and describe its actions in natural language.
More complicated scenarios may involve a baby lying on the floor in the path to the kitchen, valid user trying to turn of the agent, invalid user trying to turn of the agent and so on.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that’s not unfair in various ways
(2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.