My thoughts depend on whether you train the weaker model.
If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
If you aren’t training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)
I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.
That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).
If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward.
Remember the stronger model doesn’t care at all about the weaker model’s success, so if the weaker model rewards it for incomprehensible ideas, the strong model won’t give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense.
The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward.
The crux is that the final reward signal only leak into the stronger model through the weaker model’s prediction of that reward signal, mixed with the weaker model’s “values.” This means that many safety properties of the weaker model can carry over into the stronger model (to some extent).
It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model’s reward signal. MONA’s agent directly uses the stronger model’s ideas, my version lets the weaker model change the stronger model’s ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn’t have to change them: it’s allowed to do the same thing as what MONA does.
The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted.
I think I mostly understand what you’re saying. Essentially, MONA training is:
strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state
Your idea adds “1.5. weak model optionally asks the strong model follow-up questions” and “2.5. weak model optionally changes the step.”
If you don’t update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model’s questions are all preemptively answered, and which anticipates the weak model’s preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn’t even incentivized to score the strong model any particular way, but I’m just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it’s kind of motivated by long-term reward and kind of motivated by its initial values. There’s no guarantee on exactly what it does. I think the stipulation that the weak model remains safe “due to its architecture or whatever reason we trust it more” is doing a lot of work here; I’m not sure exactly what this would mean.
EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
If you aren’t training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be good; this can amplify capability because evaluation is often easier / cheaper than generation)
I think (1) is pretty fatal to the proposal, but (2) is just a heuristic, I could imagine with more thought concluding that it was actually a reasonable approach to take.
That said, it is a more substantial alignment tax, since you are now requiring that only the smaller model can be deployed as an agent (whereas MONA can in principle be applied to the most capable model you have).
Wait, can we try to imagine how point 1 happens?
If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward.
Remember the stronger model doesn’t care at all about the weaker model’s success, so if the weaker model rewards it for incomprehensible ideas, the strong model won’t give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense.
The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward.
The crux is that the final reward signal only leak into the stronger model through the weaker model’s prediction of that reward signal, mixed with the weaker model’s “values.” This means that many safety properties of the weaker model can carry over into the stronger model (to some extent).
It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model’s reward signal. MONA’s agent directly uses the stronger model’s ideas, my version lets the weaker model change the stronger model’s ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn’t have to change them: it’s allowed to do the same thing as what MONA does.
The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted.
Do you agree?
I think I mostly understand what you’re saying. Essentially, MONA training is:
strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state
Your idea adds “1.5. weak model optionally asks the strong model follow-up questions” and “2.5. weak model optionally changes the step.”
If you don’t update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model’s questions are all preemptively answered, and which anticipates the weak model’s preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn’t even incentivized to score the strong model any particular way, but I’m just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it’s kind of motivated by long-term reward and kind of motivated by its initial values. There’s no guarantee on exactly what it does. I think the stipulation that the weak model remains safe “due to its architecture or whatever reason we trust it more” is doing a lot of work here; I’m not sure exactly what this would mean.
EDIT: I’m not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
Oops I should have been clearer. 1.5 and 2.5 are not important parts of my version, the important part is updating the weaker model.
In the limit where the weaker model is infinitely powerful and cared only about maximizing reward, you are correct, it will reward the strong model until the strong model’s advice maximizes its own reward, and it will follow that advice exactly.
You are completely correct, that whatever reason we trust the weaker model is “doing a lot of the work.”
However, my version has the potential achieve the same capability as a typical RL model (e.g. o3) while being safer. If the model that evaluates ideas is less optimized and more “human-like,” while the model that generates ideas is more optimized and more “alien-like,” then the resulting ideas the model actually follows will resemble ideas humans will look at and say “wow that is brilliant, I could’ve never thought of that, and it works!” rather than ideas humans will look at and say “what the heck is that? Huh, it works?! I could’ve never predicted that it would work.”
Furthermore, the “values” of the system will be less affected by RL, since the evaluator model has more control over the values, and it is relatively less optimized and more “human-like.”
Given the same level of capability, it is safer.
These advantages are “automatic,” they only require the evaluator model to have relatively less RL than the generator model. If you go one step further, and use other alignment technologies with an efficient alignment tax on the evaluator model, it can get better, since the evaluator model gets even smarter while staying aligned.
Pure MONA is a special case of this generalized version, where the evaluator model has exactly zero RL (at least for the context where it is advised by a stronger model). It is like adjusting everything to the state of maximum safety and maximum alignment tax.
With pure MONA, it is probably even safer given the level of capability, but… can it reach the same level of capability?
Capabilities depend on the difficulty of evaluating a good idea compared to generating a good idea:
For tasks where evaluating good ideas/advice is obvious, then pure MONA might work just as well at the best RL models (e.g. o3).
For tasks where evaluating good ideas/advice is hard but still easier than generating them, my version might be needed. MONA’s generator model may be too sycophantic.
For tasks where evaluating good ideas/advice is even harder than generating them, then both MONA and my idea are worthless because the generator model isn’t helping, and the weaker evaluator model is essentially working on its own.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
Yes, avoiding RL would be the best case scenario. I admit that my idea may be a sort of backup in case MONA falls short in capabilities.
If additional alignment techniques act on both the generator and the evaluator, they may reduce the capabilities too much due to their alignment tax. If they only act on the evaluator, the capabilities due to the generator’s smart ideas stay, while the alignment due to the evaluator’s aligned final decisions control the whole agent.