And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so[1] during evaluations.
Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
(There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
And simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic.
I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.
Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
(There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.