This seems useful if you could get around the mind hacking problem, but how would you do that?
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
the operator might feel a desire afterwards to spread this world view
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1...
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Can you explain this?
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.