This seems useful if you could get around the mind hacking problem, but how would you do that?
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Do you have thoughts on this?
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.