My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)