Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don’t know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn’t be taken as reflective of some big strategy. I’m guessing we’ll go back to hiring a mix of the two around mid-2025.
You can check out my career FAQ, as well as various other resources linked from there.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I’d do basically the same things I’m doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don’t know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it’s not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We’ll leave it up until the later of those two (and probably somewhat beyond that, but that isn’t guaranteed). I’ve edited the post.
Is that right?
Yes, that’s broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
I think this will become much more likely once we actually start observing long-term optimization failures in prod.
Agreed, we’re not advocating for using MONA now (and say so in the paper).
Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
I meant “it’s obvious you should use MONA if you are seeing problems with long-term optimization”, which I believe is Fabien’s position (otherwise it would be “hard to find”).
Your reaction seems more like “it’s obvious MONA would prevent multi-step reward hacks”; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it’s a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with “that’s obvious” instead of “that obviously won’t perform well, why would anyone ever do it”. Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it’s mostly the first point, and to some extent the third point.
Got it, that makes more sense. (When you said “methods work on toy domains” I interpreted “work” as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is “how do you provide the nonmyopic approval so that the model actually performs well”. I don’t think anyone has even attempted to tackle this so it’s hard to tell what you could learn about it, but I’d be surprised if there weren’t generalizable lessons to be learned.
I agree that there’s not much benefit in “methods work” if that is understood as “work on the algorithm / code that given data + rewards / approvals translates it into gradient updates”. I care a lot more about iterating on how to produce the data + rewards / approvals.
My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.
I’d weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
EDIT: And tbc this is the kind of thing I mean by “improving average-case feedback quality”. I now feel like I don’t know what you mean by “feedback quality”.
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn’t have to grapple as hard with the challenge of defining the approval feedback as I’d expect in a realistic deployment. But it does impose an alignment tax, so there’s no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is “how big is the alignment tax”, and I agree we don’t know the answer to that yet and may not have enough understanding by the time it is relevant, but I don’t really see why one would think “nah it’ll only work in toy domains”.
I agree debate doesn’t work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a “toy” one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).
It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.
It’s wild to me that you’re into moonshots when your objection to existing proposals is roughly “there isn’t enough time for research to make them useful”. Are you expecting the moonshots to be useful immediately?
I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.
I’m excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I’m sad that there’s so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is “alternatives to adversarial training”. (There’s also “new moonshots for aligning superintelligence” but I don’t expect much to come out of that, and “white-box estimation of rare misbehavior” could help if you are willing to put optimization pressure against it, but that isn’t described as a core desideratum, and I don’t expect we get it. Work on externalized reasoning can also make alignment easier, but I’m not counting that as “directly relevant”.) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)
Areas that I’m more excited about relative to the median area in this post (including some of your starred areas):
Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
Mild optimization. I’m particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.
(And tbc there are plenty of other areas directly relevant to alignment that I’m less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and “gradient descent psychology” (empirically studying how fine-tuning techniques affect LLM behavior).)
Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that’s an alignment stress test.)
(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system—all of the theory depends on equilibrium behavior which you only get via training.)
I don’t think you should think of “poor info flows” as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I’d currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.
(When I said that there are approximately zero useful things that don’t make anyone’s workflow harder, I definitely had in mind things like “you have to bug other people to get the info you need”, it’s just such a background part of my model that I didn’t realize it was worth spelling out.)
In broad strokes I agree with Zac. And tbc I’m generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to “don’t build such AIs”, in which case I would appreciate that being stated more directly, or if it reduces to “limit the use of such AIs to tasks where we can formally verify soundness and uniqueness”, in which case I’d like an estimate of what fraction of economically valuable work this corresponds to).
Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won’t exploit human psychology?
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don’t accomplish much.
By far my biggest objection is that there are approximately zero useful things that “[don’t] make anyone’s workflow harder”. I expect you’re vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)
EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.
EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I’m on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they’re not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren’t the other researchers doing it?), so maybe I’d estimate 0.3%. But this isn’t super cruxy because I think you can do useful safety work with just 0.3% of the compute.
Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I’m imagining more like 1%), but I don’t see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.
Yes, it’s the same idea as the one you describe in your post. I’m pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I’m pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul’s approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.
But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called “process supervision”, (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.
Is that right?
Yup, that sounds basically right to me.
The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don’t want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the “Any other info” question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren’t going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.