Buck comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck 8 Dec 2024 17:11 UTC
LW: 8 AF: 5
5
AF
This post was an early articulation of many of the arguments and concepts that we mostly workshopped into the AI control research direction.
In particular, I think the final paragraph of the conclusion holds up really well:
But I’m more excited about the meta level point here: I think that when AI developers are first developing dangerously powerful models, alignment researchers will be doing a very different kind of activity than what they do now. Right now, alignment researchers have to do a lot of long range extrapolation: they don’t have access to either the models or the settings in which they actually care about ensuring good behavior. I think that alignment researchers haven’t quite appreciated the extent to which their current situation is disanalogous to the situation at crunch time. It seems likely to me that at crunch time, the core workflow of alignment practitioners will involve iterating on alignment techniques based on direct measurement of their effect on the evaluations that are used in safety arguments; more generally, I expect that alignment practitioners will be doing much less extrapolation. I’m therefore eager for alignment researchers to phrase their techniques in these terms and practice the kind of eval-based technique iteration that I think will be crucial later.
I think this is indeed an important insight that people were underrating.
This post was also an important step in the evolution of my philosophy for reasoning about AI safety research. I used to have a kind of “gotcha” attitude towards safety research, where if there’s any conceivable situation in which a technique wouldn’t work, I felt some disdain for the technique. We wrote this post around the time that Ryan was first really getting out of this mindset and saying “no, I want to understand the limitations of techniques in detail; we’ll write down all the strategies we can try as the blue team, because there are a lot of them and together they’re probably pretty strong; we will write down the limitations for each, but we won’t treat these techniques as contemptible just because they’re not indefinitely scalable”.
My main regret is the length of the title, which Ryan mercilessly mocks me for every time I try to cite the post to someone.
I also am sad that no-one has really developed these arguments further. There still isn’t much empirical research on this topic, which I’d now call the AI control perspective on scalable oversight, despite the fact that it seems very important. (The password-locked models paper is one attempt at doing some basic science on AI control for oversight.)
I also think it would be good if someone really carefully wrote out the connections between these arguments and various other vaguely similar ideas, e.g. prover-verifier games.