I think it is very straightforward to hurt human individual Steve without piercing Steve’s Membrane. Just create and hurt minds that Steve cares about. But don’t tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve’s Membrane). And if it is possible to build a Membrane enforcing AI, it is presumably possible to build an AI that looks at Bob’s memories of Steve, and creates some set of minds, whose fate Steve would care about. This does not involve any form of blackmail or negotiation (and definitely nothing acausal). Just Bob who wants to hurt Steve, and remembers things about Steve from before the first AI launch.
One can of course patch this. But I think there is a deeper issue in one specific case, that I think is important. Specifically: the case where the Membrane concept is supposed to protect Steve from a clever AI that wants to hurt Steve. Such an AI can think up things that humans can not think up. In this case, patching all human-findable security holes in the Membrane concept, will probably be worthless for Steve. It’s like trying to keep an AI in a box by patching all human findable security holes. Even if it were know, that all human findable security holes, had been fully patched, I don’t think that it changes things, from the perspective of Steve, if a clever AI tries to hurt him (whether the AI is inside a box, or Steve is inside a Membrane). This matters if the end goal is to build CEV. Specifically, it means that if CEV wants to hurt Steve, then the Membrane concept can’t help him.
Let’s consider a specific scenario. Someone builds a Membrane AI, with all human findable safety holes fully patched. Later, someone initiates an AI project, whose ultimate goal is to build an AI, that implements the Coherent Extrapolated Volition of Humanity. This project ends up successfully hitting the alignment target that it is aiming for. Let’s refer the resulting AI as CEV.
One common aspect of human morality, is often expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell″. A tiny group of fanatics, with morality along these lines, can end up completely dominating CEV. I have outlined a thought experiment, where this happens to the most recently published version of CEV (PCEV). It is probably best to first read this comment, that clarifies some things talked about in the post that describes the thought experiment (including important clarifications regarding the topic of the post, and regarding the nature of the claims made, as well as clarifications regarding the terminology used).
So, in this scenario, PCEV will try to implement some outcome along the lines of LP. Now the Membrane concept has to protect Steve from a very clever attacker, that can presumably easily just go around whatever patches was used to plug the human findable safety holes. Against such an attacker, it’s difficult to see how a Membrane will ever offer Steve anything of value (similar to how it is difficult to see how putting PCEV in a human constructed box, would offer Steve anything of value).
I like the Membrane concept. But I think that the intuitions that seems to be motivating it, should instead be used to find an alternative to CEV. In other words, I think the thing to aim for, is an alignment target, such that Steve can feel confident, that the result of a successful project, will not want to hurt Steve. One could for example use these underlying intuitions, to try to explore alignment targets along the lines of MPCEV, mentioned in the above post (MPCEV is based on giving each individual, meaningful influence, regarding the adoption, of those preferences, that refer to her. The idea being that Steve needs to have meaningful influence, regarding the decision, of which Steve-preferences, an AI will adopt). Doing so means that one must abandon the idea of building an AI, that is describable as doing what a group wants (which in turn, means that one must give up on CEV as an alignment target).
In the case of any AI, that is describable as a doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). Form Steve’s perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences to adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities″. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong″ (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that this AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation. Discovering that doing what humanity wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave’s cells.
One can of course give every individual meaningful influence, regarding the adoption of those preferences, that refer to her (as in MPCEV, mentioned in the linked post). So, Steve can be given this form of protection, without giving Steve any form of special treatment. But this means that one has to abandon the core concept of CEV.
I like the membrane concept on the intuition level. On the intuition level, it sort of rhymes with the MPCEV idea, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. I’m just noting that it does not actually protect Steve, from an AI that already wants to hurt Steve. However, if the underlying intuition, that seems to me to be motivating this work, is instead used to look for alternative alignment targets, then I think it might be very useful for safety (by finding an alignment target, such that a successful project would result in an AI, that does not want to hurt Steve in the first place). So, I don’t think the Membrane concept can protect Steve from a successfully implemented CEV, in the unsurprising event that CEV will want to hurt Steve. But if CEV is dropped as an alignment target, and the underlying intuition behind this work, is directed towards looking for alternative alignment targets, then I think the intuitions that seems to be motivating this work, would fit very well with the proposed research effort, described in the comment linked above.
(this is a comment about dangers related to successfully hitting a bad alignment target. It is for example not a comment about dangers related to a less powerful AI, or dangers related to projects that fail to hit an alignment target. These are very different types of dangers. So, my proposed idea of using the underlying intuitions, to look for alternative alignment targets, should be seen as complementary. It can be done, in addition to looking for Membrane related safety measures, that can protect against other forms of AI dangers. In other words: if some scenario does not involve a clever AI, that already wants to hurt Steve, then nothing I have said, implies that the Membrane concept, will be insufficient for protecting Steve. In other words: using the Membrane concept, as a basis for constructing safety measures, might be useful in general. It will however not help Steve, if a clever AI is actively trying to hurt Steve)
I think it is very straightforward to hurt human individual Steve without piercing Steve’s Membrane. Just create and hurt minds that Steve cares about. But don’t tell him about it (in other words: ensure that there is zero effect on predictions of things inside the membrane). If Bob knew Steve before the Membrane enforcing AI was built, and Bob wants to hurt Steve, then Bob presumably knows Steve well enough to know what minds to create (in other words: there is no need to have any form of access, to any form of information, that is within Steve’s Membrane). And if it is possible to build a Membrane enforcing AI, it is presumably possible to build an AI that looks at Bob’s memories of Steve, and creates some set of minds, whose fate Steve would care about. This does not involve any form of blackmail or negotiation (and definitely nothing acausal). Just Bob who wants to hurt Steve, and remembers things about Steve from before the first AI launch.
One can of course patch this. But I think there is a deeper issue in one specific case, that I think is important. Specifically: the case where the Membrane concept is supposed to protect Steve from a clever AI that wants to hurt Steve. Such an AI can think up things that humans can not think up. In this case, patching all human-findable security holes in the Membrane concept, will probably be worthless for Steve. It’s like trying to keep an AI in a box by patching all human findable security holes. Even if it were know, that all human findable security holes, had been fully patched, I don’t think that it changes things, from the perspective of Steve, if a clever AI tries to hurt him (whether the AI is inside a box, or Steve is inside a Membrane). This matters if the end goal is to build CEV. Specifically, it means that if CEV wants to hurt Steve, then the Membrane concept can’t help him.
Let’s consider a specific scenario. Someone builds a Membrane AI, with all human findable safety holes fully patched. Later, someone initiates an AI project, whose ultimate goal is to build an AI, that implements the Coherent Extrapolated Volition of Humanity. This project ends up successfully hitting the alignment target that it is aiming for. Let’s refer the resulting AI as CEV.
One common aspect of human morality, is often expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell″. A tiny group of fanatics, with morality along these lines, can end up completely dominating CEV. I have outlined a thought experiment, where this happens to the most recently published version of CEV (PCEV). It is probably best to first read this comment, that clarifies some things talked about in the post that describes the thought experiment (including important clarifications regarding the topic of the post, and regarding the nature of the claims made, as well as clarifications regarding the terminology used).
So, in this scenario, PCEV will try to implement some outcome along the lines of LP. Now the Membrane concept has to protect Steve from a very clever attacker, that can presumably easily just go around whatever patches was used to plug the human findable safety holes. Against such an attacker, it’s difficult to see how a Membrane will ever offer Steve anything of value (similar to how it is difficult to see how putting PCEV in a human constructed box, would offer Steve anything of value).
I like the Membrane concept. But I think that the intuitions that seems to be motivating it, should instead be used to find an alternative to CEV. In other words, I think the thing to aim for, is an alignment target, such that Steve can feel confident, that the result of a successful project, will not want to hurt Steve. One could for example use these underlying intuitions, to try to explore alignment targets along the lines of MPCEV, mentioned in the above post (MPCEV is based on giving each individual, meaningful influence, regarding the adoption, of those preferences, that refer to her. The idea being that Steve needs to have meaningful influence, regarding the decision, of which Steve-preferences, an AI will adopt). Doing so means that one must abandon the idea of building an AI, that is describable as doing what a group wants (which in turn, means that one must give up on CEV as an alignment target).
In the case of any AI, that is describable as a doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). Form Steve’s perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences to adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities″. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong″ (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that this AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation. Discovering that doing what humanity wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave’s cells.
One can of course give every individual meaningful influence, regarding the adoption of those preferences, that refer to her (as in MPCEV, mentioned in the linked post). So, Steve can be given this form of protection, without giving Steve any form of special treatment. But this means that one has to abandon the core concept of CEV.
I like the membrane concept on the intuition level. On the intuition level, it sort of rhymes with the MPCEV idea, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. I’m just noting that it does not actually protect Steve, from an AI that already wants to hurt Steve. However, if the underlying intuition, that seems to me to be motivating this work, is instead used to look for alternative alignment targets, then I think it might be very useful for safety (by finding an alignment target, such that a successful project would result in an AI, that does not want to hurt Steve in the first place). So, I don’t think the Membrane concept can protect Steve from a successfully implemented CEV, in the unsurprising event that CEV will want to hurt Steve. But if CEV is dropped as an alignment target, and the underlying intuition behind this work, is directed towards looking for alternative alignment targets, then I think the intuitions that seems to be motivating this work, would fit very well with the proposed research effort, described in the comment linked above.
(this is a comment about dangers related to successfully hitting a bad alignment target. It is for example not a comment about dangers related to a less powerful AI, or dangers related to projects that fail to hit an alignment target. These are very different types of dangers. So, my proposed idea of using the underlying intuitions, to look for alternative alignment targets, should be seen as complementary. It can be done, in addition to looking for Membrane related safety measures, that can protect against other forms of AI dangers. In other words: if some scenario does not involve a clever AI, that already wants to hurt Steve, then nothing I have said, implies that the Membrane concept, will be insufficient for protecting Steve. In other words: using the Membrane concept, as a basis for constructing safety measures, might be useful in general. It will however not help Steve, if a clever AI is actively trying to hurt Steve)