ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.