(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format)
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is:
One natural approach to judge safety is bootstrapping: when aligning a new AI system, use the previous generation of AI (that has already been aligned) as the judge.
I don’t think we have any fully aligned models, and won’t any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart’s law definitely applies here.
You don’t have to stop the agent, you can just do it afterwards.
Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems:
This approach is especially well-suited to monitoring deployed systems, due to the sheer scale required at deployment.
Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don’t need to stop and involve a human? I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it’ll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn’t work, and I think it’ll only work less well as AI becomes smarter and more agentic.
This is definitely the crux so probably really the only point worth debating.
RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it’s still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don’t like, but I think agents will reliably overcome that obstacle as useful agents won’t just be outputting the most probable continuation, they’ll be searching through decision space and finding those unlikely continuations that will score well on its task. I don’t think RLHF does anything remotely analogous to making it care about whether it’s following your intent, or following a constitution, etc.
You’re definitely aware of the misalignment that still exists with our current RLHF’d models and have read the recent papers on alignment faking etc. so I probably can’t make an argument you haven’t already heard.
Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don’t.