I think I mostly understand what you’re saying. Essentially, MONA training is:
strong model comes up with a step
weak model outputs a score
the step gets executed
repeat, starting from the new state
Your idea adds “1.5. weak model optionally asks the strong model follow-up questions” and “2.5. weak model optionally changes the step.”
If you don’t update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model’s questions are all preemptively answered, and which anticipates the weak model’s preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn’t even incentivized to score the strong model any particular way, but I’m just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it’s kind of motivated by long-term reward and kind of motivated by its initial values. There’s no guarantee on exactly what it does. I think the stipulation that the weak model remains safe “due to its architecture or whatever reason we trust it more” is doing a lot of work here; I’m not sure exactly what this would mean.
Yeah, if you’re keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you’re using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.