I was trying to get a clearer picture of how training works in debate so I wrote out the following. It is my guess based on reading the paper, so parts of it could be incorrect (corrections are welcome!), but perhaps it could be helpful to others.
My question was: is the training process model-free or model-based? After looking into it more and writing this up, I’m convinced it’s model-based, but I think maybe either could work? (I’d be interested if anyone has a take on that.)
In the model-free case, I think it would not be trained like AlphaGo Zero, but instead using something like PPO. Whereas in the model-based case it would be more similar to AlphaGo Zero, where training would use Monte Carlo tree search and debate would serve as a policy improvement operator, making it IDA. Or does it not matter? (n.b. I’m using model-free and model-based in the RL sense here, where “model” is not the ML model but rather a model of the game which allows the network to simulate the game in its mind.)
More details on the approaches:
Model-free — During training, the network gets reward for winning the game, and the (e.g.) policy gradient algorithm updates it to take more winning moves in the future.
Model-based — During training, the network’s policy head is trained to better predict the result of the amplification process, i.e. what move it would make after simulating the debate in its mind. Edit: I’m not sure how you would compute the distance between two possible utterances though, to constitute the loss. Maybe something like is used in RLHF fine-tuning for LLMs but I’m not familiar with that.
Both — In both cases, [my guess is that] the network is outputting arbitrary utterances, which could be position-taking sentences or argumentative sentences.
The relevant paper sections I found on this are:
″...we propose training agents via self play on a zero sum debate game.” AND “We can approximate optimal play by training ML systems via self play, which has shown impressive performance in games such as Go, chess, shogi, and Dota 2 [Silver et al., 2016, 2017a,b, OpenAI, 2017].” AND “Similarly, the deep networks used in Silver et al. [2017b] are convolutional residual networks unrelated to the game tree of Go, though the training process does involve the tree via MCTS.”
Strongly implies the model-based approach given the self-play and MCTS. But maybe either approach can be viewed as self play? (In the model-free case if it’s playing a copy of itself.)
“The equivalence is far from exact: the feedback for a debate is about the whole game and the feedback for amplification is per step, debate as presented uses reinforcement learning while the easiest versions of amplification use supervised learning, and so on. However all these features can be adjusted in either direction.”
Not exactly sure, but maybe this is saying that either approach works?
“In contrast to a legal argument or a typical competitive debate, the two players in this game are allowed to choose what they are arguing for, including both arguing for the same thing.”
Hence the “arbitrary utterances” thing above.
“At test time it suffices to stop after step 2: we do not need to run the debate (though agents could simulate debates at test time to strengthen answers).”
The parenthetical implies that the model-based approach would be used. However, under both approaches I think it would be valid to not run the debate at test time. Whatever “opening move” the network takes would be its stance on the proposition (e.g. on the “Where should we go on vacation” question, its first utterance would likely be something like “Aruba”.)
I was trying to get a clearer picture of how training works in debate so I wrote out the following. It is my guess based on reading the paper, so parts of it could be incorrect (corrections are welcome!), but perhaps it could be helpful to others.
My question was: is the training process model-free or model-based? After looking into it more and writing this up, I’m convinced it’s model-based, but I think maybe either could work? (I’d be interested if anyone has a take on that.)
In the model-free case, I think it would not be trained like AlphaGo Zero, but instead using something like PPO. Whereas in the model-based case it would be more similar to AlphaGo Zero, where training would use Monte Carlo tree search and debate would serve as a policy improvement operator, making it IDA. Or does it not matter? (n.b. I’m using model-free and model-based in the RL sense here, where “model” is not the ML model but rather a model of the game which allows the network to simulate the game in its mind.)
More details on the approaches:
Model-free — During training, the network gets reward for winning the game, and the (e.g.) policy gradient algorithm updates it to take more winning moves in the future.
Model-based — During training, the network’s policy head is trained to better predict the result of the amplification process, i.e. what move it would make after simulating the debate in its mind. Edit: I’m not sure how you would compute the distance between two possible utterances though, to constitute the loss. Maybe something like is used in RLHF fine-tuning for LLMs but I’m not familiar with that.
Both — In both cases, [my guess is that] the network is outputting arbitrary utterances, which could be position-taking sentences or argumentative sentences.
The relevant paper sections I found on this are:
″...we propose training agents via self play on a zero sum debate game.” AND “We can approximate optimal play by training ML systems via self play, which has shown impressive performance in games such as Go, chess, shogi, and Dota 2 [Silver et al., 2016, 2017a,b, OpenAI, 2017].” AND “Similarly, the deep networks used in Silver et al. [2017b] are convolutional residual networks unrelated to the game tree of Go, though the training process does involve the tree via MCTS.”
Strongly implies the model-based approach given the self-play and MCTS. But maybe either approach can be viewed as self play? (In the model-free case if it’s playing a copy of itself.)
“The equivalence is far from exact: the feedback for a debate is about the whole game and the feedback for amplification is per step, debate as presented uses reinforcement learning while the easiest versions of amplification use supervised learning, and so on. However all these features can be adjusted in either direction.”
Not exactly sure, but maybe this is saying that either approach works?
“In contrast to a legal argument or a typical competitive debate, the two players in this game are allowed to choose what they are arguing for, including both arguing for the same thing.”
Hence the “arbitrary utterances” thing above.
“At test time it suffices to stop after step 2: we do not need to run the debate (though agents could simulate debates at test time to strengthen answers).”
The parenthetical implies that the model-based approach would be used. However, under both approaches I think it would be valid to not run the debate at test time. Whatever “opening move” the network takes would be its stance on the proposition (e.g. on the “Where should we go on vacation” question, its first utterance would likely be something like “Aruba”.)