Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say “honesty is always a winning move” rather than “honesty is the only winning move”). These certainly depend on modeling assumptions but the assumptions are more like “assume the models are sufficiently capable” not “assume we can give them a goal”. When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it’s wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
Are you stopping the agent periodically to have another debate about what it’s working on and asking the human to review another debate?
You don’t have to stop the agent, you can just do it afterwards.
can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you’re outlining here.)
Google DeepMind: An Approach to Technical AGI Safety and Security
Rather, I think that most of the value lies in something more like “enabling oversight of cognition, despite not having data that isolates that cognition.”
Is this a problem you expect to arise in practice? I don’t really expect it to arise, if you’re allowing for a significant amount of effort in creating that data (since I assume you’d also be putting a significant amount of effort into interpretability).
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
We’ve got a lot of interest, so it’s taking some time to go through applications. If you haven’t heard back by the end of March, please ping me; hopefully it will be sooner than that.
The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don’t want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the “Any other info” question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren’t going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don’t know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn’t be taken as reflective of some big strategy. I’m guessing we’ll go back to hiring a mix of the two around mid-2025.
You can check out my career FAQ, as well as various other resources linked from there.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I’d do basically the same things I’m doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don’t know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it’s not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We’ll leave it up until the later of those two (and probably somewhat beyond that, but that isn’t guaranteed). I’ve edited the post.
AGI Safety & Alignment @ Google DeepMind is hiring
A short course on AGI safety from the GDM Alignment team
Is that right?
Yes, that’s broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
I think this will become much more likely once we actually start observing long-term optimization failures in prod.
Agreed, we’re not advocating for using MONA now (and say so in the paper).
Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
I meant “it’s obvious you should use MONA if you are seeing problems with long-term optimization”, which I believe is Fabien’s position (otherwise it would be “hard to find”).
Your reaction seems more like “it’s obvious MONA would prevent multi-step reward hacks”; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it’s a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with “that’s obvious” instead of “that obviously won’t perform well, why would anyone ever do it”. Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it’s mostly the first point, and to some extent the third point.
(Meta: Going off of past experience I don’t really expect to make much progress with more comments, so there’s a decent chance I will bow out after this comment.)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
I am having a hard time parsing this as having more content than “something could go wrong while bootstrapping”. What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Yeah I’d expect debates to be an auditing mechanism if used at deployment time.
Any alignment approach will always be subject to the critique “what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses”. I’m not trying to be robust to that critique.
I’m not saying I don’t worry about fooling the cheap system—I agree that’s a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than “what if it didn’t work”.
??? RLHF does work currently? What makes you think it doesn’t work currently?