I think that these two proposed constraints, will indeed remove some bad outcomes. But I don’t think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It’s genuinely not personal. They think that it would be highly unethical, for any AI, to let heretics go unpunished. They really do not want, the fate of the world, to be decided by an unethical AI. Any world, where such an unethical entity, has exerted such power, is a dark world. And the LP outcome can be implemented, even if the heretics are no longer around.
More generally: The problem, from the perspective of Steve, is that these two constraints, does not actually grant Steve any meaningful influence, regarding the adoption of those preferences, that refer to Steve. I think that such influence, is a necessary (but far from sufficient) feature, for an AI to be better than extinction (in expectation, from the perspective essentially any human individual). So, my proposal, would be to explore various ways, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. One way of doing this, would be to explore different ways, of modifying PCEV, in such a way that the Modified version of PCEV (MPCEV), does give each individual, in the set of individuals that MPCEV is pointed at, such influence. For example along the lines of (some version of) the following rule:
If a preference is about Steve, then MPCEV will only take this preference into account, if: (i): the preference counts as concern for the well being of Steve, or if (ii): Steve would approve, of MPCEV taking this preference into account.
Even more generally, I think that it is important, and urgent, to make progress on what I call the ``what alignment target should be aimed at?″ question, and that you refer to as Goalcraft. (in addition to your past work on CEV variants, it was your Goalcraft post, that made me DM you, and point you to this post). Exploring different ways of modifying PCEV, sounds to me like a promising way, towards meaningful progress on this question. I think that s-risk from successfully hitting a bad alignment target, is a serious, and very under explored, issue. I think that there are important differences, between this type of s-risk, and the type of AI risks, that is associated with ``aiming failures″. In particular, progress on the ``what alignment target should be aimed at?″ question, can reduce the former type of s-risk (and this can be done, even if one does not find an actual answer). One way of reducing this s-risk, is to find problems with existing proposals. Another way, is to describe general features, that are necessary for safety (for example along the lines of the ``individuals must have meaningful influence, over the adoption, of those preferences, that refer to her″ feature mentioned above). A third way to reduce the s-risk, that comes from successfully hitting the wrong alignment target, is to show, that the ``what alignment target should be aimed at?″ question is, genuinely, unintuitive.
One very positive thing, that happens to be true, is that the class of bad outcomes, that I am trying to prevent, would probably involve a very capable design team, that is careful and clever enough, to actually hit, what they are aiming for. Explaining insights to such a design team, sounds feasible (including meta insights, such as the fact that this question is, genuinly, unintuitive). In other words: once insights have been generated, it will probably be relatively easy to communicate these insights (at least compared to many other ``AI is dangerous″ related communication tasks). First, however, such insights must be generated. And this will probably require some dedicated effort. So, the immediate task, as far as I can tell, is to create a community of people, that are fully focused, on exploring the ``what alignment target should be aimed at?″ question.
I think that these two proposed constraints, will indeed remove some bad outcomes. But I don’t think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It’s genuinely not personal. They think that it would be highly unethical, for any AI, to let heretics go unpunished. They really do not want, the fate of the world, to be decided by an unethical AI. Any world, where such an unethical entity, has exerted such power, is a dark world. And the LP outcome can be implemented, even if the heretics are no longer around.
More generally: The problem, from the perspective of Steve, is that these two constraints, does not actually grant Steve any meaningful influence, regarding the adoption of those preferences, that refer to Steve. I think that such influence, is a necessary (but far from sufficient) feature, for an AI to be better than extinction (in expectation, from the perspective essentially any human individual). So, my proposal, would be to explore various ways, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. One way of doing this, would be to explore different ways, of modifying PCEV, in such a way that the Modified version of PCEV (MPCEV), does give each individual, in the set of individuals that MPCEV is pointed at, such influence. For example along the lines of (some version of) the following rule:
If a preference is about Steve, then MPCEV will only take this preference into account, if: (i): the preference counts as concern for the well being of Steve, or if (ii): Steve would approve, of MPCEV taking this preference into account.
Even more generally, I think that it is important, and urgent, to make progress on what I call the ``what alignment target should be aimed at?″ question, and that you refer to as Goalcraft. (in addition to your past work on CEV variants, it was your Goalcraft post, that made me DM you, and point you to this post). Exploring different ways of modifying PCEV, sounds to me like a promising way, towards meaningful progress on this question. I think that s-risk from successfully hitting a bad alignment target, is a serious, and very under explored, issue. I think that there are important differences, between this type of s-risk, and the type of AI risks, that is associated with ``aiming failures″. In particular, progress on the ``what alignment target should be aimed at?″ question, can reduce the former type of s-risk (and this can be done, even if one does not find an actual answer). One way of reducing this s-risk, is to find problems with existing proposals. Another way, is to describe general features, that are necessary for safety (for example along the lines of the ``individuals must have meaningful influence, over the adoption, of those preferences, that refer to her″ feature mentioned above). A third way to reduce the s-risk, that comes from successfully hitting the wrong alignment target, is to show, that the ``what alignment target should be aimed at?″ question is, genuinely, unintuitive.
One very positive thing, that happens to be true, is that the class of bad outcomes, that I am trying to prevent, would probably involve a very capable design team, that is careful and clever enough, to actually hit, what they are aiming for. Explaining insights to such a design team, sounds feasible (including meta insights, such as the fact that this question is, genuinly, unintuitive). In other words: once insights have been generated, it will probably be relatively easy to communicate these insights (at least compared to many other ``AI is dangerous″ related communication tasks). First, however, such insights must be generated. And this will probably require some dedicated effort. So, the immediate task, as far as I can tell, is to create a community of people, that are fully focused, on exploring the ``what alignment target should be aimed at?″ question.