Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
I’m concerned about overloading the word “benign” with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn’t match either informal usage or a previous technical definition. In particular this “benign” AGI (in the limit) will hack the operator’s mind to give itself maximum reward, if that’s possible, right?
The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode. Maybe not a big deal but it’s kind of a counter-example to “We argue that our algorithm produces
an AGI that, even if it became omniscient, would continue to
accomplish whatever task we wanted, instead of hijacking its
reward, eschewing its task, and neutralizing threats to it, even
if it saw clearly how to do exactly that.”
(Feel free to count this as some number of comments between 1 and 4, since some of the above items are related. Also I haven’t read most of the math yet and may have more comments and questions once I understood the motivations and math better.)
4. If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode...
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
3. The system seems limited to answering questions that the human operator can correctly evaluate...
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
This seems useful if you could get around the mind hacking problem, but how would you do that?
On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
This seems useful if you could get around the mind hacking problem, but how would you do that?
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
You’re right (see the redaction). Why Wei is right. Here’s an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent’s actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent’s actions are maximally inconvenient.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
This seems useful if you could get around the mind hacking problem, but how would you do that?
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
the operator might feel a desire afterwards to spread this world view
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1...
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
The best alternative to “benign” that I could come up with is “unambitious”. I’m not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you’re interested in giving out a prize specifically for this?
What do you think about “aligned”? (in the sense of having goals which don’t interfere with our own, by being limited in scope to the events of the room)
“We’re talking about what you do, not what you do.”
“Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire.”
“Ah,” you reply, “I’m asking what you think about how your life would go if you lived it way up until now. I think I would be interested in hearing about that.
“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for example, that you might want to give someone a toy/summarized toy by the same criteria as other people in the room and make them play the role of toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing.
It seems like the answer would be quite different.
“Oh, then,” you say, “That seems like too much work. Let me try harder!”
“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”
“Not exactly. But don’t you think there should be any al variative reasons why this is always so hard, or that any al variative reasons are not just illuminative, or couldn’t be found some other way?”
“That’s not exactly how I would put it. I’m fully closed about it. I’m still working on it. I don’t know whether I could get this outcome without spending so much effort on finding this particular method of doing something, because I don’t think it would happen without them trying it, so it’s not like they’re trying to determine whether that outcome is real or not.”
“Ah...” said your friend, staring at you in horror. “So, did you ever even think of the idea, or did it just
A second comment, but it doesn’t seem worth an answer: it can’t be an explicit statement of what would happen if you tried this, and it seems to me unlikely that my initial reaction when it was presented in the first place was insincere, so it seems like a really good idea to let it propagate in your mind a little. I’m hoping a lot of good ideas do become useful this time.
There’s still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn’t optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but “benign” and “not existentially dangerous” both seem to be claiming a bit too much. I’ll think about what other term might be more suitable.
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of “could have powerful effects outside the box” is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it’s useful. And there is also certainly no way to rule out how those side effects “might turn out to be,” without a complete view of the future.
Would you agree that unprecedentedness captures the concern?
Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
I’m concerned about overloading the word “benign” with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn’t match either informal usage or a previous technical definition. In particular this “benign” AGI (in the limit) will hack the operator’s mind to give itself maximum reward, if that’s possible, right?
The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode. Maybe not a big deal but it’s kind of a counter-example to “We argue that our algorithm produces an AGI that, even if it became omniscient, would continue to accomplish whatever task we wanted, instead of hijacking its reward, eschewing its task, and neutralizing threats to it, even if it saw clearly how to do exactly that.”
(Feel free to count this as some number of comments between 1 and 4, since some of the above items are related. Also I haven’t read most of the math yet and may have more comments and questions once I understood the motivations and math better.)
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
This seems useful if you could get around the mind hacking problem, but how would you do that?
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Do you have thoughts on this?
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
You’re right (see the redaction). Why Wei is right. Here’s an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent’s actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent’s actions are maximally inconvenient.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
Yep.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Can you explain this?
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
The best alternative to “benign” that I could come up with is “unambitious”. I’m not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you’re interested in giving out a prize specifically for this?
What do you think about “aligned”? (in the sense of having goals which don’t interfere with our own, by being limited in scope to the events of the room)
To clarify, I’m looking for:
“We’re talking about what you do, not what you do.”
“Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire.”
“Ah,” you reply, “I’m asking what you think about how your life would go if you lived it way up until now. I think I would be interested in hearing about that.
“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for example, that you might want to give someone a toy/summarized toy by the same criteria as other people in the room and make them play the role of toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing.
It seems like the answer would be quite different.
“Oh, then,” you say, “That seems like too much work. Let me try harder!”
“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”
“Not exactly. But don’t you think there should be any al variative reasons why this is always so hard, or that any al variative reasons are not just illuminative, or couldn’t be found some other way?”
“That’s not exactly how I would put it. I’m fully closed about it. I’m still working on it. I don’t know whether I could get this outcome without spending so much effort on finding this particular method of doing something, because I don’t think it would happen without them trying it, so it’s not like they’re trying to determine whether that outcome is real or not.”
“Ah...” said your friend, staring at you in horror. “So, did you ever even think of the idea, or did it just
What do you think about “domesticated”?
My comment is more like:
A second comment, but it doesn’t seem worth an answer: it can’t be an explicit statement of what would happen if you tried this, and it seems to me unlikely that my initial reaction when it was presented in the first place was insincere, so it seems like a really good idea to let it propagate in your mind a little. I’m hoping a lot of good ideas do become useful this time.
There’s still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn’t optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but “benign” and “not existentially dangerous” both seem to be claiming a bit too much. I’ll think about what other term might be more suitable.
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of “could have powerful effects outside the box” is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it’s useful. And there is also certainly no way to rule out how those side effects “might turn out to be,” without a complete view of the future.
Would you agree that unprecedentedness captures the concern?
I think my concern is a bit more specific than that. See this comment.