Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn’t have to grapple as hard with the challenge of defining the approval feedback as I’d expect in a realistic deployment. But it does impose an alignment tax, so there’s no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is “how big is the alignment tax”, and I agree we don’t know the answer to that yet and may not have enough understanding by the time it is relevant, but I don’t really see why one would think “nah it’ll only work in toy domains”.
I agree debate doesn’t work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a “toy” one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).
It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.
It’s wild to me that you’re into moonshots when your objection to existing proposals is roughly “there isn’t enough time for research to make them useful”. Are you expecting the moonshots to be useful immediately?
I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.
I’m excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I’m sad that there’s so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is “alternatives to adversarial training”. (There’s also “new moonshots for aligning superintelligence” but I don’t expect much to come out of that, and “white-box estimation of rare misbehavior” could help if you are willing to put optimization pressure against it, but that isn’t described as a core desideratum, and I don’t expect we get it. Work on externalized reasoning can also make alignment easier, but I’m not counting that as “directly relevant”.) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)
Areas that I’m more excited about relative to the median area in this post (including some of your starred areas):
Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
Mild optimization. I’m particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.
(And tbc there are plenty of other areas directly relevant to alignment that I’m less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and “gradient descent psychology” (empirically studying how fine-tuning techniques affect LLM behavior).)
Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that’s an alignment stress test.)
(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system—all of the theory depends on equilibrium behavior which you only get via training.)
I don’t think you should think of “poor info flows” as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I’d currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly disagree with it.
(When I said that there are approximately zero useful things that don’t make anyone’s workflow harder, I definitely had in mind things like “you have to bug other people to get the info you need”, it’s just such a background part of my model that I didn’t realize it was worth spelling out.)
In broad strokes I agree with Zac. And tbc I’m generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to “don’t build such AIs”, in which case I would appreciate that being stated more directly, or if it reduces to “limit the use of such AIs to tasks where we can formally verify soundness and uniqueness”, in which case I’d like an estimate of what fraction of economically valuable work this corresponds to).
Can you sketch out how one produces a sound overapproximation of human psychology? Or how you construct a safety specification that the AIs won’t exploit human psychology?
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don’t accomplish much.
By far my biggest objection is that there are approximately zero useful things that “[don’t] make anyone’s workflow harder”. I expect you’re vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)
EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.
EDIT 2 (Feb 7): To be clear, I also disagree with the compute number. I’m on board with starting with 1% since they are 1% of the headcount. But then I would decrease it first because they’re not a high-compute capabilities team, and second because whatever they are doing should be less useful to the company than whatever the other researchers are doing (otherwise why weren’t the other researchers doing it?), so maybe I’d estimate 0.3%. But this isn’t super cruxy because I think you can do useful safety work with just 0.3% of the compute.
Again, I could imagine getting more compute with a well-selected group of 10 people (though even then 3% seems unlikely, I’m imagining more like 1%), but I don’t see why in this scenario you should assume you get a well-selected group, as opposed to 10 random AGI safety researchers.
Yes, it’s the same idea as the one you describe in your post. I’m pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I’m pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul’s approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.
But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called “process supervision”, (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.
Is that right?
Yup, that sounds basically right to me.
Indeed this seems related! We’d hope that there isn’t too much conflict between the different “generations”, since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
If it were as simple as “just ask an LLM to choose actions” someone would have deployed this product a while ago.
But in any case I agree this isn’t the most interesting case for MONA, I talked about it because that’s what Daniel asked about.
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)
In Test-driven development the argument above isn’t quite correct, because we prompted the agent to be a bad programmer but didn’t do this with the reward, so the overseer is “stronger” than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn’t yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide—just whether an individual test is valid and correct—so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you’re using a prediction market it’s no longer accurate to say that individual humans understand the strategy.
(We’ve seen this comment and are looking into options)
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it’s more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won’t get to it in the near future, but it would be great to see others working on this!
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
If you aren’t training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)
I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.
That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).
Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
MONA: Managed Myopia with Approval Feedback
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn’t do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn’t successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don’t think it’s crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don’t think a minimal stable solution involves most of the world’s compute going towards alignment research.
To be clear, it’s quite plausible that we want to do the actions you suggest, because even if they aren’t literally necessary, they can still reduce risk and that is valuable. I’m just objecting to the claim that if we didn’t have any one of them then we very likely get catastrophically bad results.
OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn’t specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
Got it, that makes more sense. (When you said “methods work on toy domains” I interpreted “work” as a verb rather than a noun.)
I think by far the biggest open question is “how do you provide the nonmyopic approval so that the model actually performs well”. I don’t think anyone has even attempted to tackle this so it’s hard to tell what you could learn about it, but I’d be surprised if there weren’t generalizable lessons to be learned.
I agree that there’s not much benefit in “methods work” if that is understood as “work on the algorithm / code that given data + rewards / approvals translates it into gradient updates”. I care a lot more about iterating on how to produce the data + rewards / approvals.
I’d weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
EDIT: And tbc this is the kind of thing I mean by “improving average-case feedback quality”. I now feel like I don’t know what you mean by “feedback quality”.