In the article, you measured the MONA setup against a pure RL agent.
I’m curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won’t[2] reward hack more than the MONA setup, but it’s unclear to me if it would also have worse performance.
I could imagine discovering the myopic MONA agent converging on
Picking whatever the overseer-AI would pick (since those actions would seem “obviously correct” to the overseer AI and result in the most reward)
Picking actions that seem impressive but are useless (bad! we could get better performance, still with ~no reward hacking, by asking the overseer directly[1])
Picking clever useful actions that the overseer wouldn’t have thought of (win!)
I’d give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.
MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I’d still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)
In Test-driven development the argument above isn’t quite correct, because we prompted the agent to be a bad programmer but didn’t do this with the reward, so the overseer is “stronger” than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn’t yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide—just whether an individual test is valid and correct—so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.
Hey,
In the article, you measured the MONA setup against a pure RL agent.
I’m curious about measuring MONA against the overseer-AI picking the next move directly[1]: The overseer-AI probably won’t[2] reward hack more than the MONA setup, but it’s unclear to me if it would also have worse performance.
I could imagine discovering the myopic MONA agent converging on
Picking whatever the overseer-AI would pick (since those actions would seem “obviously correct” to the overseer AI and result in the most reward)
Picking actions that seem impressive but are useless (bad! we could get better performance, still with ~no reward hacking, by asking the overseer directly[1])
Picking clever useful actions that the overseer wouldn’t have thought of (win!)
I’d give the overseer-AI scaffolding to make it more comparable to the MONA setup, such as the ability to observe the consequences of its own actions and roll them back if they seem bad.
MONA is trying to use the overseer-AI to put an upper bound on how much reward hacking happens, if I understood correctly. If anyone runs this experiment, I’d still validate that assumption, since it seems easy (the paper already has a way to measure reward hacking).
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI development.)
In Test-driven development the argument above isn’t quite correct, because we prompted the agent to be a bad programmer but didn’t do this with the reward, so the overseer is “stronger” than the agent. However this was just because the agent is already very strongly finetuned to be good at coding so there was no headroom to climb, and we wanted to demonstrate that MONA would improve things if there was headroom to climb. I would bet that if we had a powerful model that wasn’t yet finetuned strongly for coding, then we would once again see your option (3). The rewards are quite easy to provide—just whether an individual test is valid and correct—so I think a less capable model should be able to provide them, while still getting the benefits we see in the experiments we did run.