So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
Yep.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.