habryka comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

habryka 6 Sep 2024 18:35 UTC
4 points
2
And simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic.
I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.