I think the Go example really gets to the heart of why I think Debate doesn’t cut it.
The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that’s too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.
What does this framework need to apply to moral arguments? That humans “know the rules” of argumentation, that we can recognize good arguments when we see them, and that what we really need help with is searching the game tree of arguments to find high-scoring policies of argumentation.
This immediately should sound a little off. If humans have any exploits (or phrased differently, if there are places where our meta-preferences and our behavior conflict), then this search process will try to find them. We can imagine trying to patch humans (e.g. giving them computer assistants), but this patching process has to already be the process of bringing human behavior in line with human meta-preferences! It’s the patching process that’s doing all the alignment work, reducing the Debate part to a fancy search for high-approval actions.
No, the dream of Debate is that it’s a game where human meta-preferences and behavior are already aligned. For all places where they diverge, the dream is that there’s some argument that will point this out and permanently fix it, and that this inconsistency-resolution process does not itself violate too many of our meta-preferences. That Debate is fair like Go is fair—each move is incremental, you can’t place a Go stone that changes the layout of the board to make it impossible for your opponent to win.
I think the Go example really gets to the heart of why I think Debate doesn’t cut it.
Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking “what is a provably secure alignment proposal?”
Good question. There’s a big roadblock to your idea as stated, which is that asking something to define “alignment” is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up—could we then use Debate on the question “does this candidate match the verbal specification?”
I don’t know—I think it still depends on how bad humans are as judges of arguments—we’ve made the domain more objective, but maybe there’s some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that’s a very high bar maybe I should apply more creativity to my expextations.
suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up—could we then use Debate on the question “does this candidate match the verbal specification?”
I’m less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them—which may be difficult, but hopefully less difficult with Debate involved.
I think the Go example really gets to the heart of why I think Debate doesn’t cut it.
The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that’s too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.
What does this framework need to apply to moral arguments? That humans “know the rules” of argumentation, that we can recognize good arguments when we see them, and that what we really need help with is searching the game tree of arguments to find high-scoring policies of argumentation.
This immediately should sound a little off. If humans have any exploits (or phrased differently, if there are places where our meta-preferences and our behavior conflict), then this search process will try to find them. We can imagine trying to patch humans (e.g. giving them computer assistants), but this patching process has to already be the process of bringing human behavior in line with human meta-preferences! It’s the patching process that’s doing all the alignment work, reducing the Debate part to a fancy search for high-approval actions.
No, the dream of Debate is that it’s a game where human meta-preferences and behavior are already aligned. For all places where they diverge, the dream is that there’s some argument that will point this out and permanently fix it, and that this inconsistency-resolution process does not itself violate too many of our meta-preferences. That Debate is fair like Go is fair—each move is incremental, you can’t place a Go stone that changes the layout of the board to make it impossible for your opponent to win.
Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking “what is a provably secure alignment proposal?”
Good question. There’s a big roadblock to your idea as stated, which is that asking something to define “alignment” is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up—could we then use Debate on the question “does this candidate match the verbal specification?”
I don’t know—I think it still depends on how bad humans are as judges of arguments—we’ve made the domain more objective, but maybe there’s some policy of argumentation that still wins by what we would consider cheating. I can imagine being convinced that it would work by seeing Debates play out with superhuman litigators, but since that’s a very high bar maybe I should apply more creativity to my expextations.
I’m less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them—which may be difficult, but hopefully less difficult with Debate involved.