I think I mostly understand what you’re saying. Essentially, MONA training is:
strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state
Your idea adds “1.5. weak model optionally asks the strong model follow-up questions” and “2.5. weak model optionally changes the step.”
If you don’t update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model’s questions are all preemptively answered, and which anticipates the weak model’s preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn’t even incentivized to score the strong model any particular way, but I’m just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it’s kind of motivated by long-term reward and kind of motivated by its initial values. There’s no guarantee on exactly what it does. I think the stipulation that the weak model remains safe “due to its architecture or whatever reason we trust it more” is doing a lot of work here; I’m not sure exactly what this would mean.
EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.
I think I mostly understand what you’re saying. Essentially, MONA training is:
strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state
Your idea adds “1.5. weak model optionally asks the strong model follow-up questions” and “2.5. weak model optionally changes the step.”
If you don’t update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model’s questions are all preemptively answered, and which anticipates the weak model’s preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn’t even incentivized to score the strong model any particular way, but I’m just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it’s kind of motivated by long-term reward and kind of motivated by its initial values. There’s no guarantee on exactly what it does. I think the stipulation that the weak model remains safe “due to its architecture or whatever reason we trust it more” is doing a lot of work here; I’m not sure exactly what this would mean.
EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.