I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are: (1) it seems hard to make a good environment that’s not unfair in various ways (2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that’s not unfair in various ways
(2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.