Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)
Maybe this doesn’t come up as much in your conversation with people, but I’ve seen internals based testing methods which don’t clearly ground out in behavioral evidence discussed often.
(E.g., it’s the application that the Anthropic interp team has most discussed, it’s the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven’t been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).
I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don’t apply.
I wasn’t trying to trigger any research particular reprioritization with this post, but I historically found that people hadn’t really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can’t help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.
You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).
[...]
However, I’m quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) and AI companies effectively can’t pay more to get faster or much better employees, so we’re not at a particularly privileged point in human AI R&D capabilities.
SlowCorp has 625K H100s per researcher. What do you even do with that much compute if you drop it into this world? Is every researcher just sweeping hyperparameters on the biggest pretraining runs? I’d normally say “scale up pretraining another factor of 100” and then expect that SlowCorp could plausibly outperform NormalCorp, except you’ve limited them to 1 week and a similar amount of total compute, so they don’t even have that option (and in fact they can’t even run normal pretraining runs, since those take longer than 1 week to complete).
The quality and amount of labor isn’t the primary problem here. The problem is that the current practices for AI development are specialized to the current labor:compute ratio, and can’t just be changed on a dime if you drastically change the ratio. Sure, the compute input has varied massively over 7 OOMs; importantly this did not happen all at once, the ecosystem adapted to it.
SlowCorp would be in a much better position if it was in a world where AI development had evolved with these kinds of bottlenecks existing all along. Frontier pretraining runs would be massively more parallel, and would complete in a day. There would be dramatically more investment in automation of hyperparameter sweeps and scaling analyses, rather than depending on human labor to do that. The inference-time compute paradigm would have started 1-2 years earlier, and would be significantly more mature. How fast would AI progress be in that world if you are SlowCorp? I agree it would still be slower than current AI progress, but it is really hard to guess how much slower, and it’s definitely drastically faster than if you just impute a SlowCorp in today’s world (which mostly seems like it will flounder and die immediately).
So we can break down the impacts into two categories:
SlowCorp is slower because of less access to resources. This is the opposite for AutomatedCorp, so you’d expect it to be correspondingly faster.
SlowCorp is slower because AI development is specialized to the current labor:compute ratio. This is not the opposite for AutomatedCorp, if anything it will also slow down AutomatedCorp (but in practice it probably doesn’t affect AutomatedCorp since there is so much serial labor for AutomatedCorp to fix the issue).
If you want to pump your intuition for what AutomatedCorp should be capable of, the relevant SlowCorp is the one that only faces the first problem, that is, you want to consider the SlowCorp that evolved in a world with those constraints in place all along, not the SlowCorp thrown into a research ecosystem not designed for the constraints it faces. Personally, once I try to imagine that I just run into a wall of “who even knows what that world looks like” and fail to have my intuition pumped.
In some sense I agree with this post, but I’m not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate “evidence which is just from fancy internals-based methods (and can’t be supported by human inspection of AI behavior)”, and that is the primary story for why it is impactful? I don’t think this is true of probing, SAEs, circuit analysis, debate, …
(Meta: Going off of past experience I don’t really expect to make much progress with more comments, so there’s a decent chance I will bow out after this comment.)
I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
and goodhart’s law definitely applies here.
I am having a hard time parsing this as having more content than “something could go wrong while bootstrapping”. What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Is this intended only as a auditing mechanism, not a prevention mechanism
Yeah I’d expect debates to be an auditing mechanism if used at deployment time.
I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
Any alignment approach will always be subject to the critique “what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses”. I’m not trying to be robust to that critique.
I’m not saying I don’t worry about fooling the cheap system—I agree that’s a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than “what if it didn’t work”.
The problem is RLHF already doesn’t work
??? RLHF does work currently? What makes you think it doesn’t work currently?
like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say “honesty is always a winning move” rather than “honesty is the only winning move”). These certainly depend on modeling assumptions but the assumptions are more like “assume the models are sufficiently capable” not “assume we can give them a goal”. When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it’s wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
Are you stopping the agent periodically to have another debate about what it’s working on and asking the human to review another debate?
You don’t have to stop the agent, you can just do it afterwards.
can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you’re outlining here.)
Google DeepMind: An Approach to Technical AGI Safety and Security
Rather, I think that most of the value lies in something more like “enabling oversight of cognition, despite not having data that isolates that cognition.”
Is this a problem you expect to arise in practice? I don’t really expect it to arise, if you’re allowing for a significant amount of effort in creating that data (since I assume you’d also be putting a significant amount of effort into interpretability).
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
We’ve got a lot of interest, so it’s taking some time to go through applications. If you haven’t heard back by the end of March, please ping me; hopefully it will be sooner than that.
The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don’t want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the “Any other info” question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren’t going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don’t know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn’t be taken as reflective of some big strategy. I’m guessing we’ll go back to hiring a mix of the two around mid-2025.
You can check out my career FAQ, as well as various other resources linked from there.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I’d do basically the same things I’m doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don’t know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it’s not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We’ll leave it up until the later of those two (and probably somewhat beyond that, but that isn’t guaranteed). I’ve edited the post.
I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.