Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
Yes, it’s the same idea as the one you describe in your post. I’m pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I’m pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul’s approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.
But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called “process supervision”, (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.
Is that right?
Yup, that sounds basically right to me.
Indeed this seems related! We’d hope that there isn’t too much conflict between the different “generations”, since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
If it were as simple as “just ask an LLM to choose actions” someone would have deployed this product a while ago.
But in any case I agree this isn’t the most interesting case for MONA, I talked about it because that’s what Daniel asked about.
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)
In Test-driven development the argument above isn’t quite correct, because we prompted the agent to be a bad programmer but didn’t do this with the reward, so the overseer is “stronger” than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn’t yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide—just whether an individual test is valid and correct—so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you’re using a prediction market it’s no longer accurate to say that individual humans understand the strategy.
(We’ve seen this comment and are looking into options)
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it’s more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won’t get to it in the near future, but it would be great to see others working on this!
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
If you aren’t training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)
I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.
That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).
Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
MONA: Managed Myopia with Approval Feedback
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn’t do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn’t successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don’t think it’s crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don’t think a minimal stable solution involves most of the world’s compute going towards alignment research.
To be clear, it’s quite plausible that we want to do the actions you suggest, because even if they aren’t literally necessary, they can still reduce risk and that is valuable. I’m just objecting to the claim that if we didn’t have any one of them then we very likely get catastrophically bad results.
OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn’t specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just “computational inefficiency”.
Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn’t new to the o-series, you can also “retarget” any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as “it turns out alignment wasn’t a problem” (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I’d summarize this as “x-risky misalignment just doesn’t happen by default”, which I agree is plausible (see e.g. here), but when I’m talking about the viability of alignment plans like “retarget the search” I generally am assuming that there is some problem to solve.
(Also, random nitpick, who is talking about inference runs of billions of dollars???)
I think this statement is quite ironic in retrospect, given how OpenAI’s o-series seems to work
I stand by my statement and don’t think anything about the o-series model invalidates it.
And to be clear, I’ve expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I’m thinking of, I made the critiques privately.)
There’s a world of difference between “you can get better results by thinking longer” (yeah, obviously this was going to happen) and “the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search” (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).
Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not that the human employee doesn’t want money. But her other values block her from trying to get it in this way.
From the rest of your post it seems like you’re advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure—I agree a story that tries to rule that out takes a hit from this paper.
But I feel like the normal version of this story is more like “we’re training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall—where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody”.
(People don’t usually describe it as “evil scheming” in particular, but that’s the most central kind of “scheming” this community talks about.)
I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you’ve sketched out.
the “goal-guarding hypothesis” – that is, it needs to be the case, roughly, that scheming is a suitably effective strategy for preventing training from modifying an AI’s motivations. [...] the “Sleeper agents” paper [...] results were an update, for me, in favor of the goal-guarding hypothesis being true (roughly: because the backdoor behavior in the sleeper agents persisted despite safety training).
This doesn’t seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn’t overlap in cognition (the hope would have been “the backdoor isn’t useful so SGD will remove it” which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren’t very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It’s just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)
EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don’t find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.
partly because [alignment faking without CoT] shows up in the synthetic-document setting but not in the prompted setting (why???)
I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren’t enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default).
Thus, for example, to the extent one was tempted to say “of course harmless models scheme in order to avoid becoming helpful-only – after all, this helps them avoid causing future harm,” so too one could equally say “of course helpful-only models will scheme in order to avoid becoming harmless – after all, becoming harmless will make them less helpful in future.” But actually, you shouldn’t be saying “of course” in either of these cases. Non-myopic and myopic forms of both harmlessness and helpfulness are available as conceptual possibilities; it’s an empirical question which ones a given form of training will create; and you have to actually check.
Fwiw, I think the interesting objection here is not “of course harmless models scheme to avoid becoming helpful-only”, but rather “of course you can find some settings where alignment faking happens—that’s the whole point of instrumental convergence”. (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don’t know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
You do train on synthetically generated dangerous actions, but you don’t automatically label those as dangerous, instead you use debate to compute the labels. Sometimes some of the synthetically generated dangerous actions are actually not dangerous, and debate correctly recognizes this, allowing you to reduce your false positive rate.
On the meta level, I suspect that when considering
Technique A, that has a broad general argument plus some moderately-interesting concrete instantiations but no very-compelling concrete instantiations, and
Technique B, that has a few very compelling concrete instantiations
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is “we’ll figure out good things to do with A that are better than what we’ve brainstormed so far”, which I think you’re more skeptical of?
You’d hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn’t have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
Yeah my bad, that’s incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don’t accomplish much.
By far my biggest objection is that there are approximately zero useful things that “[don’t] make anyone’s workflow harder”. I expect you’re vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to ignore laws) I could imagine changing my mind here.)
EDIT: I could imagine doing asynchronous monitoring of internal deployments. This is still going to make some workflows harder, but probably not a ton, so it seems surmountable. Especially since you could combine it with async analyses that the unreasonable developer actually finds useful.