Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn’t get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T ‘basic actions in the environment’ and then for each of those 1T ‘basic actions’ there’s 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won’t be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)
Adding onto Rohin’s comment, assuming you’re interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of “debate” on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we’re worried this would lead to them colluding.
Adding more details to this picture:
We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better.
We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion.
The human’s task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system.
The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out:
“This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.”
“This step seems irrelevant. Maybe the agent is using steganography?”
“Although this step is incorrect, it gets the agent closer to the final answer, because…”
We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best.
Finally, the approval model for MONA is trained on a score derived from the judge model.
This is all done in an online training setup where the human stays in the loop as the AI systems continue to improve.
This all depends on the assumption that the (human + AI debaters) system isn’t so hackable as to render its feedback useless. This assumption seems pretty safe for pre-superhuman agents and <1M token CoTs. Because the above training setup keeps a human in the loop, we can at least conclude that the agent won’t learn to take actions that are obviously-bad to humans, e.g. outputting gibberish tokens like “describing.\ + similarlyNow write oppositeley.]”
Maybe with an intelligent enough AI or a long enough CoT, the human rater becomes so confused that their feedback is completely decoupled from “accomplishing tasks in the real world.” It’s debatable at what point this would happen; however, I think it’s pretty clearly easier to hack a human’s approval for the end result of a weeks-long task than for a single step. If outcome-based supervision is the alternative, MONA seems like a big improvement.
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to si+1 but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?
If I understand correctly (and I don’t think I do, which is why I’m asking) MONA in this setting would look something like:
Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn’t get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T ‘basic actions in the environment’ and then for each of those 1T ‘basic actions’ there’s 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.
Problem: Won’t this training run basically hack the judges, and produce actions that look good to the judges but aren’t actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won’t be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)
Adding onto Rohin’s comment, assuming you’re interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of “debate” on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we’re worried this would lead to them colluding.
Adding more details to this picture:
We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better.
We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion.
The human’s task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system.
The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out:
“This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.”
“This step seems irrelevant. Maybe the agent is using steganography?”
“Although this step is incorrect, it gets the agent closer to the final answer, because…”
We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best.
Finally, the approval model for MONA is trained on a score derived from the judge model.
This is all done in an online training setup where the human stays in the loop as the AI systems continue to improve.
This all depends on the assumption that the (human + AI debaters) system isn’t so hackable as to render its feedback useless. This assumption seems pretty safe for pre-superhuman agents and <1M token CoTs. Because the above training setup keeps a human in the loop, we can at least conclude that the agent won’t learn to take actions that are obviously-bad to humans, e.g. outputting gibberish tokens like “describing.\ + similarlyNow write oppositeley.]”
Maybe with an intelligent enough AI or a long enough CoT, the human rater becomes so confused that their feedback is completely decoupled from “accomplishing tasks in the real world.” It’s debatable at what point this would happen; however, I think it’s pretty clearly easier to hack a human’s approval for the end result of a weeks-long task than for a single step. If outcome-based supervision is the alternative, MONA seems like a big improvement.
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you’re imagining.
But it sounds like you’re talking about a pretty weak (subhuman) AI system. So I’m not quite sure why you’d expect it to “hack” the judge.
Certainly sometimes the overseer will approve of some action that isn’t actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the “Menu” link because they aren’t paying attention or whatever. But then after that the agent’s next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the “Book” link; the overseer shouldn’t make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don’t expect it to be useless for accomplishing tasks.
(Though really in the computer use setting I expect I’d recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)
Your description of the setup sounds reasonable, though given the weak levels of capability you’re imagining I don’t think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.
Also as mentioned above I’d probably recommend the overseer gets access to si+1 but even if that weren’t the case I’d still think it should be feasible to build a non-useless agent. (Though I’m not taking a stance on how it would compare to one trained with outcome RL.)
EDIT: I’m not sure how big each action you are considering is. If it’s 10 tokens, such that you can only realistically do stuff at the level of “click this button”, then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.
Thanks this is helpful. Is MONA basically “Let’s ONLY use process-based feedback, no outcome-based feedback?”
Another objection: If this works for capabilities, why haven’t the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
And also “don’t propagate rewards backwards in time”, which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, “don’t propagate rewards backwards in time” is the primary focus in this paper—in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).
… As a person who works at a corporation, it’s a bit tricky to speculate on this publicly, and I’m not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.
Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn’t much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don’t want to use MONA. (See also the second limitation in the post.)
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer
Edit: I elaborated
Plus maybe let the overseer observe the result and say “oops” and roll back that action, if we can implement a rollback in this context
If it were as simple as “just ask an LLM to choose actions” someone would have deployed this product a while ago.
But in any case I agree this isn’t the most interesting case for MONA, I talked about it because that’s what Daniel asked about.