The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability,
How is this true, if the debaters don’t get to choose which output they are arguing for? Aren’t they instead incentivized to say that whatever output they are assigned is the best?
Yeah my bad, that’s incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)
How is this true, if the debaters don’t get to choose which output they are arguing for? Aren’t they instead incentivized to say that whatever output they are assigned is the best?
Yeah my bad, that’s incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
My guess is they do kinda choose: in training, it’s less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.
Edit: maybe this is different in procedures different from the one Rohin outlined.
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″
Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)