Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
Not a full response, but some notes:
I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I find it pretty plausible that shutdown corrigibility is especially anti-natural. Relatedly, (1) most CIRL agents will not satisfy shutdown corrigibility even at early stages, (2) most of the discussion on Paul!corrigibility doesn’t emphasize or even mention shutdown corrigibility.
I agree Eliezer has various strategic considerations in mind that bear on how he thinks about corrigibility. I mostly don’t share those considerations.
I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else. If it’s (1), you’ll need to understand my strategic considerations (you can pretend I’m Paul, that’s not quite accurate but it covers a lot). If it’s (2), I would focus elsewhere, I have spent quite a lot of time engaging with the Eliezer / Nate perspective.
I definitely was not thinking about the quoted definition of corrigibility, which I agree is not capturing what at least Eliezer, Nate and Paul are saying about corrigibility (unless there is more to it than the quoted paragraph). I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.
I do wish I hadn’t used the phrases “object-level” and “meta-level” and just spent 4 paragraphs unpacking what I meant by that because in hindsight that was confusing and ambiguous, but such is real-time conversation. When I had time to reflect and write a summary, I wrote:
Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user’s preferences, accepting corrections about what it should do, etc.
which feels much better as a short summary though still not great.
I basically continue to feel like there is some clear disconnect going on between Paul and MIRI on this topic that is reflected in the linked comment. It may not be about the definition of corrigibility, but just about how hard it is to get it, e.g. if you simply train your inscrutable neural nets on examples that you understand, will it generalize to examples that you don’t understand, in a way that is compatible with being superintelligent / making plans-that-lase.
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural. It’s not actually my own crux for this—mostly I am just imagining an AI system that has a motivation to be corrigible w.r.t the operator, learned via gradient descent, which was doable because corrigibility is a relatively clear boundary (for an intelligent system) that seems like it should be relatively easier to learn (i.e. what you write in your edit).
I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.
Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)
Maybe this doesn’t come up as much in your conversation with people, but I’ve seen internals based testing methods which don’t clearly ground out in behavioral evidence discussed often.
(E.g., it’s the application that the Anthropic interp team has most discussed, it’s the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven’t been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).
I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don’t apply.
I wasn’t trying to trigger any research particular reprioritization with this post, but I historically found that people hadn’t really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can’t help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.
You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).
[...]
However, I’m quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking) and AI companies effectively can’t pay more to get faster or much better employees, so we’re not at a particularly privileged point in human AI R&D capabilities.
SlowCorp has 625K H100s per researcher. What do you even do with that much compute if you drop it into this world? Is every researcher just sweeping hyperparameters on the biggest pretraining runs? I’d normally say “scale up pretraining another factor of 100” and then expect that SlowCorp could plausibly outperform NormalCorp, except you’ve limited them to 1 week and a similar amount of total compute, so they don’t even have that option (and in fact they can’t even run normal pretraining runs, since those take longer than 1 week to complete).
The quality and amount of labor isn’t the primary problem here. The problem is that the current practices for AI development are specialized to the current labor:compute ratio, and can’t just be changed on a dime if you drastically change the ratio. Sure, the compute input has varied massively over 7 OOMs; importantly this did not happen all at once, the ecosystem adapted to it.
SlowCorp would be in a much better position if it was in a world where AI development had evolved with these kinds of bottlenecks existing all along. Frontier pretraining runs would be massively more parallel, and would complete in a day. There would be dramatically more investment in automation of hyperparameter sweeps and scaling analyses, rather than depending on human labor to do that. The inference-time compute paradigm would have started 1-2 years earlier, and would be significantly more mature. How fast would AI progress be in that world if you are SlowCorp? I agree it would still be slower than current AI progress, but it is really hard to guess how much slower, and it’s definitely drastically faster than if you just impute a SlowCorp in today’s world (which mostly seems like it will flounder and die immediately).
So we can break down the impacts into two categories:
SlowCorp is slower because of less access to resources. This is the opposite for AutomatedCorp, so you’d expect it to be correspondingly faster.
SlowCorp is slower because AI development is specialized to the current labor:compute ratio. This is not the opposite for AutomatedCorp, if anything it will also slow down AutomatedCorp (but in practice it probably doesn’t affect AutomatedCorp since there is so much serial labor for AutomatedCorp to fix the issue).
If you want to pump your intuition for what AutomatedCorp should be capable of, the relevant SlowCorp is the one that only faces the first problem, that is, you want to consider the SlowCorp that evolved in a world with those constraints in place all along, not the SlowCorp thrown into a research ecosystem not designed for the constraints it faces. Personally, once I try to imagine that I just run into a wall of “who even knows what that world looks like” and fail to have my intuition pumped.
In some sense I agree with this post, but I’m not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate “evidence which is just from fancy internals-based methods (and can’t be supported by human inspection of AI behavior)”, and that is the primary story for why it is impactful? I don’t think this is true of probing, SAEs, circuit analysis, debate, …
(Meta: Going off of past experience I don’t really expect to make much progress with more comments, so there’s a decent chance I will bow out after this comment.)
I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its successor across a wider variety of situations.
and goodhart’s law definitely applies here.
I am having a hard time parsing this as having more content than “something could go wrong while bootstrapping”. What is the metric that is undergoing optimization pressure during bootstrapping / amplified oversight that leads to decreased correlation with the true thing we should care about?
Is this intended only as a auditing mechanism, not a prevention mechanism
Yeah I’d expect debates to be an auditing mechanism if used at deployment time.
I also worry the “cheap system with high recall but low precision” will be too easy to fool for the system to be functional past a certain capability level.
Any alignment approach will always be subject to the critique “what if you failed and the AI became misaligned anyway and then past a certain capability level it evades all of your other defenses”. I’m not trying to be robust to that critique.
I’m not saying I don’t worry about fooling the cheap system—I agree that’s a failure mode to track. But useful conversation on this seems like it has to get into a more detailed argument, and at the very least has to be more contentful than “what if it didn’t work”.
The problem is RLHF already doesn’t work
??? RLHF does work currently? What makes you think it doesn’t work currently?
like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say “honesty is always a winning move” rather than “honesty is the only winning move”). These certainly depend on modeling assumptions but the assumptions are more like “assume the models are sufficiently capable” not “assume we can give them a goal”. When applying this in practice there is also a clear divergence between what an equilibrium behavior is and what is found by RL in practice.
Despite all the caveats, I think it’s wildly inaccurate to say that Amplified Oversight is assuming the ability to give the debate partner the goal of actually trying to get to the truth.
(I agree it is assuming that the judge has that goal, but I don’t see why that’s a terrible assumption.)
Are you stopping the agent periodically to have another debate about what it’s working on and asking the human to review another debate?
You don’t have to stop the agent, you can just do it afterwards.
can anybody provide a more detailed sketch of why they think Amplified Oversight will work and how it can be used to make agents safe in practice?
Have you read AI safety via debate? It has really quite a lot of conceptual points, making both the case in favor and considering several different reasons to worry.
(To be clear, there is more research that has made progress, e.g. cross-examination is a big deal imo, but I think the original debate paper is more than enough to get to the bar you’re outlining here.)
Google DeepMind: An Approach to Technical AGI Safety and Security
Rather, I think that most of the value lies in something more like “enabling oversight of cognition, despite not having data that isolates that cognition.”
Is this a problem you expect to arise in practice? I don’t really expect it to arise, if you’re allowing for a significant amount of effort in creating that data (since I assume you’d also be putting a significant amount of effort into interpretability).
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
We’ve got a lot of interest, so it’s taking some time to go through applications. If you haven’t heard back by the end of March, please ping me; hopefully it will be sooner than that.
The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don’t want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the “Any other info” question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren’t going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don’t know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn’t be taken as reflective of some big strategy. I’m guessing we’ll go back to hiring a mix of the two around mid-2025.
You can check out my career FAQ, as well as various other resources linked from there.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I’d do basically the same things I’m doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don’t know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it’s not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
I think you are being led astray by having a one-dimensional notion of intelligence.
Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.
I disagree that this property necessarily goes away as soon as the AI is “smarter” or has “more common sense”. You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
Why can’t this apply in the AI / human case?
I’m not calling that property corrigibility, I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.