Daniel Kokotajlo comments on Zach Stein-Perlman’s Shortform

Daniel Kokotajlo 16 Dec 2024 23:31 UTC
2 points
0
Maybe the fix to the protocol is: Debater copy #1 is told “You go first. Pick an output y, and then argue for it.” Debater copy #2 is then told “You go second. Pick a different, conflicting output y2, and then argue against y and for y2″

Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)