“winning the debate” is a leaky abstraction. There are many possible ways that the debate could be messed with.
The hope of debate as I see it, is that if the AI’s are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI’s doesn’t like it. If one of the AI’s can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.
It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn’t produce these conditions. Of course, this depends on the context, and it might well not be true.
If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of “winning the debate” is a good abstraction. If, within this domain, the AI’s utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.
This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, “make sure that a finger touches this button” and “make sure that electricity flows under this button”, assuming these are perfectly correlated during training.
Debate could work with either proxy, so long as both debating AI’s use the same proxy.
If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.
I agree with what Paul and Donald are saying, but the post was trying to make a different point.
Among various things needed to “make debate work”, I see three separate sub-problems:
(A) Ensuring that “agents use words to get a human to select them as the winner; and that this is their only terminal goal” is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human’s head to explode and their body falls on the reward button, this doesn’t count.)
(B) Having already accomplished (A), ensure that “agents use words to convince the human that their answer is better” is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)
(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.
While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI’s signals), you have already made a mistake.
On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal—most solution to (B) will be compatible with most solutions to (C).
For (A), Michael Cohen’s Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.)
Michael’s recent “AI Debate” Debate post seems to be primarily concerned about (B).
Finally, this post could be rephrased as “When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.”.
“winning the debate” is a leaky abstraction. There are many possible ways that the debate could be messed with.
The hope of debate as I see it, is that if the AI’s are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI’s doesn’t like it. If one of the AI’s can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.
It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn’t produce these conditions. Of course, this depends on the context, and it might well not be true.
If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of “winning the debate” is a good abstraction. If, within this domain, the AI’s utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.
This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, “make sure that a finger touches this button” and “make sure that electricity flows under this button”, assuming these are perfectly correlated during training.
Debate could work with either proxy, so long as both debating AI’s use the same proxy.
If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.
I agree with what Paul and Donald are saying, but the post was trying to make a different point.
Among various things needed to “make debate work”, I see three separate sub-problems:
(A) Ensuring that “agents use words to get a human to select them as the winner; and that this is their only terminal goal” is a good abstraction. (Please read this in the intended meaning of the sentence. No, if there is a magical word that causes the human’s head to explode and their body falls on the reward button, this doesn’t count.)
(B) Having already accomplished (A), ensure that “agents use words to convince the human that their answer is better” is a good abstraction. (Not sure how to operationalize this, but you want to, for example, ensure that: (i) Agents do not collaboratively convince the human to give reward to both of them. (ii) If the human could in principle be brainwashed, the other agent will be able and incentivized to prevent this. In particular, no brainwashing in a single claim.)
(C)Having already accomplished (A) and (B), ensure that AIs in debate only convince us of safe and useful things.
While somewhat related, I think these three problems should be tackled separately as much as possible. Indeed, (A) seems to not really be specific to debate, because a similar problem can be posed for any advanced AI. Moreover, I think that if you are relying on the presence of the other agent to help you with (A) (e.g., one AI producing signals to block the other AI’s signals), you have already made a mistake. On the other hand, it seems fine to rely on the presence of the other agent for both (B) and (C). However, my intuition is that these problems are mostly orthogonal—most solution to (B) will be compatible with most solutions to (C).
For (A), Michael Cohen’s Boxed Myopic AI seems like a particularly relevant thing. (Not saying that what it proposes is enough, nor that it is required in all scenarios.) Michael’s recent “AI Debate” Debate post seems to be primarily concerned about (B). Finally, this post could be rephrased as “When people talk about debate, they often focus on (C). And that seems fair. However, if you make debate non-zero-sum, your (B) will break.”.