You seem to be making a huge number of assumptions of what “scheming” means, so I am now mostly assuming you are talking about something else than what I am talking about, but I’ll try to clarify.
I agree there is miscommunication about what “scheming” means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
I think Tom uses “scheming” to point at models that are very worst-case-like, in a way current models are not, whereas you think “there is no hard line between ‘role-playing’ and ‘scheming’.”
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations. This requires a high degree of adversariality on the model’s part, which doesn’t fit with my understanding of “role-playing”, and which I take to be the reason why Tom makes “a huge number of assumptions” about scheming.
There are complications from “even benign models sometimes fail and do weird stuff when (adversarially) evaluated”. These are not trivial, but I think we can incorporate that into our model.
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations.
Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
(There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
And simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic.
I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.
In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so during evaluations.
Hopefully, but I wouldn’t necessarily expect that we’ll have reasonably well designed control measures in place!
I agree there is miscommunication about what “scheming” means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
I think Tom uses “scheming” to point at models that are very worst-case-like, in a way current models are not, whereas you think “there is no hard line between ‘role-playing’ and ‘scheming’.”
And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations. This requires a high degree of adversariality on the model’s part, which doesn’t fit with my understanding of “role-playing”, and which I take to be the reason why Tom makes “a huge number of assumptions” about scheming.
There are complications from “even benign models sometimes fail and do weird stuff when (adversarially) evaluated”. These are not trivial, but I think we can incorporate that into our model.
Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
(There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.
Hopefully, but I wouldn’t necessarily expect that we’ll have reasonably well designed control measures in place!