I’m mentally substituting continuet for some question more like “should this debate continue?”, because I think the setup you describe keeps going until Amp is satisfied with an answer, which might be never for weak M. It’s also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there’s a right answer that the judge might be convinced of, I think M will be trained to give it no matter the step parity, because when that happens it gets rewarded.
Really, it feels like the state of the debate is more like the state of a RNN, and you’re going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.
I really love the level of detail in this sketch!
I’m mentally substituting continuet for some question more like “should this debate continue?”, because I think the setup you describe keeps going until Amp is satisfied with an answer, which might be never for weak M. It’s also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there’s a right answer that the judge might be convinced of, I think M will be trained to give it no matter the step parity, because when that happens it gets rewarded.
Really, it feels like the state of the debate is more like the state of a RNN, and you’re going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.