My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.
The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.
I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)
My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)
The proposals described in your points 1 and 2 are very different from any of the proposals that I am analysing in the post. I consider this to be a good thing. But I wanted to note explicitly that this discussion has now moved very far away from what was discussed in the post, and is best seen as a new discussion (a discussion that starts with the proposals described in your points 1 and 2). Making this clear is important, because it means that many points made in the post (and also earlier in this thread) do not apply to the class of proposals that we are now discussing.
In particular: all alignment targets analysed in the post are Group AIs. But the alignment target described in your point 1: Coherent Extrapolation of Equanimous Volition (CEEV), is not a Group AI. Given that the primary focus of the post is to analyse the Group AI idea, the analysis of CEEV below is best seen as starting a completely new discussion. Among other things, this means that many arguments from the post about Group AIs will probably not apply to CEEV. (CEEV is still very bad for individuals. Because it is still the case that no individual has any meaningful influence regarding the way in which CEEV adopts those preferences that refer to her. One specific issue is that some CEEV delegates will still prefer outcomes where heretics are punished, because some delegates will still have an aversion to unethical AIs. The issue is described in detail in the last section of this comment).
The rule for deciding which actions are available to Delegates during negotiations, described in your point 2, is also a large departure from anything discussed in the post. The described rule would accept actions, even though those actions would make things dramatically worse for some people. I think that this makes it a very different kind of rule, compared to Davidad’s proposed Pareto Improvement rule. The points that I made about Pareto Improvements in the post, and earlier in this thread, do not apply to this new class of rules. (The set of actions is still rendered empty by the rule described in your point 2, due to a large and varied set of hard constraints demanding that the AI must not be unethical. A single pair of such demands can render the set empty, by having incompatible views regarding what it means for an AI to be unethical. Some pairs of demands like this have nothing to do with utility inversion. The issue is described in detail in the next section of this comment).
It also makes sense to explicitly note here that with the rule described in your point 2, you have now started to go down the path of removing entire classes of constraints from consideration (as opposed to going down the path of looking for new Pareto Baselines). Therefore, my statement that the path that you are exploring is unlikely to result in a non-empty set no longer applies. That statement was expressing doubt about finding a usable Pareto Baseline that would result in a non-empty set. But in my view you are now doing something very different (and far more interesting) than looking for a usable Pareto Baseline that would result in a non-empty set.
I will spend most of this comment talking about the proposals described in your points 1 and 2. But let’s first try to wrap up the previous topics, starting with Bob2. Bob2 is only different from Bob in the sense that Bob2 does not see an AI that literally never acts as a person. I don’t see why Bob2′s way of looking at things would be strange or unusual. A thing that literally never acts can certainly be seen as a person. But it doesn’t have to be seen as a person. Both perspectives seem reasonable. These two different classifications are baked into a core value, related to the Dark Future concept. (In other words: Bob and Bob2 have different values. So there is no reason to think that learning new facts would make them agree on this point. Because there is no reason to think that learning new facts would change core values). In a population of billions, there will thus be plenty of people that share Bob2′s way of looking at such an AI. So if the AI is pointed at billions of humans, the set of Pareto Improvements will be rendered empty by people like Bob2 (relative to the alternative no-AI-action Pareto Baseline that you discussed here).
Now let’s turn to your point about the size of the action space. Most of my previous points probably do not apply to rules that will ignore entire classes of constraints (such as the “pathological constraints” that you mention). In that case everything depends on how one defines this class of constraints. Rules that do ignore classes of constraints are discussed in the next section of this comment. However: for rules that do not ignore any constraints, the number of actions is not necessarily relevant (in other words: while we are still talking about Pareto Improvements, the number of actions is not necessarily relevant). One can roughly describe the issue as: If one constraint demands X. And another constraint refuses X. Then the set is empty. Regardless of the number of actions.
I’m not sure whether or not there is any significant disagreement left on this issue. But I will still elaborate a bit more on how I see the original situation (the situation where pathological constraints are not ignored).
One can say that everything is short circuited by the fact that humans often have very strong opinions about who should be in charge. (And there are many different types of ontologies that are compatible with such sentiments. Which means that we can expect a great variety in terms of what this implies regarding demands about the AI). Wanting the right type of person to be in charge can be instrumental. But it does not have to be instrumental. And there is nothing unusual about demanding things that are entirely symbolic. (In other words: there is nothing unusual about Dennis, who demands that The Person in Charge must do or value things that have no connection with the material situation of Dennis).
This is not part of every ontology. But caring about who is in charge is a common human value (at least common enough for a population of billions to include a great variety of hard constraints related to this general type of sentiment). The number of actions does not help if one person rejects all trajectories where the person in charge is X. And another person rejects any trajectory unless the person in charge is X. (Combined with the classification of a trajectory that contains a non-misaligned and clever AI, that takes any action, as a trajectory where the first such AI is in charge). (I don’t know if we actually disagree on anything here. Perhaps you would classify all constraints along these lines as Pathological Constraints). (In the next section I will point out that while such incompatible pairs can be related to utility inversion. They do not have to be.)
I will first discuss the proposal described in your point 2 in the next section, and then discuss the proposal described in your point 1 in the last section (because finding the set of actions that are available to delegates happens before delegates start negotiating).
The rule for determining which set of actions will be included in negotiations between delegates
The rule described in your point 2 still results in an empty set, for the same reason that Davidad’s original Pareto Improvement rule results in an empty set. The rule described in your point 2 still does not remove the problem of Bob from the original thought experiment of the post. Because the thing that Bob objects to is an unethical AI. The issue is not about Bob wanting to hurt Dave, or about Bob wanting to believe that the AI is ethical (or that Bob might want to believe that Dave is punished. Or that Bob might want to see Dave being punished). The issue is that Bob does not want the fate of humanity to be determined by an unethical AI.
Demanded punishments also do not have to refer to Dave’s preferences. It can be the case that Gregg demands that Dave’s preferences are inverted. But it can also be the case that Gregg demands that Dave be subjected to some specific treatment (and this can be a treatment that Dave will categorically reject). There is nothing unexpected about a fanatic demanding that heretics be subjected to a specific type of treatment. It is not feasible to eliminate all “Problematic Constraints” along these lines by eliminating some specific list of constraint types (for example along the lines of: utility inverting constraints, or hostile constraints, or demands that people suffer). Which in combination with the fact that Dave still has no meaningful influence over those constraints that are about Dave, means that there is still nothing preventing someone from demanding that things happen to Dave, that Dave finds completely unacceptable. A single such constraint is sufficient for rendering the action space empty (regardless of the size of the action space).
When analysing this type of rule it might actually be best to switch to a new type of person, that has not been part of my past thought experiments. Specifically: the issue with the rule described in your point 2 can also be illustrated using a thought experiment that does not involve any preferences that in any way refer to any human. The basic situation is that two people have incompatible demands regarding how an AI must interact with a specific sacred place or object, in order for the AI to be considered acceptable.
Let’s take ancient Egyptian religion as an example in order to avoid contemporary politics. Consider Intef who was named after the Pharaoh who founded Middle Kingdom Egypt, and Ahmose who was named after the Pharaoh who founded New Kingdom Egypt. They both consider it to be a moral imperative to restore temples to their rightful state (if one has the power to do so). But they disagree on when Egyptian religion was right, and therefore disagree on what the AI must do to avoid being classified as unethical (in the sense of the Dark Future concept).
Specifically: a Middle Kingdom temple was destroyed and the stones were used to build a New Kingdom temple. Later that temple was also destroyed. Intef considers it to be a moral imperative to use the stones to rebuild the older temple (if one has the ability to do so). And Ahmose considers it to be a moral imperative to use the same stones to rebuild the newer temple (if one has the ability to do so). Neither of them thinks that an unethical AI is acceptable (after the AI is classified as unethical the rest of the story follows the same path as the examples with Bob or Bob2). So the set would still be empty, even if a rule simply ignores every constraint that in any way refers to any human.
Neither of these demands are in any way hostile (or vicious, or based in hate, or associated with malevolent people, or belligerent, or anything else along such lines). Neither of these demands is on its own problematic or unreasonable. On its own, either of these demands is in fact trivial to satisfy (the vast majority of people would presumably be perfectly ok with either option). And neither of these demands looks dangerous (nor would they result in an advantage in Parliamentarian Negotiations). Very few people would watch the world burn rather than let Intef use the original stones to rebuild his preferred temple. But it only takes one person like Ahmose to make the set of actions empty.
Let’s go through another iteration and consider AI47 who uses a rule that ignores some additional constraints. When calculating whether or not an action can be used in delegate negotiations, AI47 ignores all preferences that (i): refer to AI47 (thus completely ignoring all demands that AI47 not be unethical), or (ii): refer to any human, or (iii): are dangerous, or (iv): are based on hate / bitterness / spite / ego / etc / etc, or (v): make demands that are unreasonable or difficult to satisfy. Let’s say that in the baseline trajectory that alternative trajectories are compared to, AI47 never acts. If AI47 never acts, then this would lead to someone eventually launching a misaligned AI that would destroy the temple stones (and also kill everyone).
Intef and Ahmose both think that if a misaligned AI destroys the stones, then this counts as the stones being destroyed in an accident (comparable from a moral standpoint to the case where the stones are destroyed by an unpreventable natural disaster). Conditioned on a trajectory where the stones are not used to restore the right temple, both prefer a trajectory where the stones are destroyed by accident. (In addition to caring about the ethics of the AI that is in charge, they also care about the stones themselves.). And there is no way for a non-misaligned, clever AI (like AI47), to destroy the stones by accident (in a sense that they would consider to be equivalent to an unpreventable natural disaster). So the set is still empty.
In other words: even though this is no longer an attempt to find a usable Pareto Baseline that simultaneously satisfies many trillions of hard constraints, a single pair of constraints can still make the set empty. And it is still an attempt to deal with a large set of hard constraints, defined in a great variety of ontologies. It is also still true that (in addition to constraints coming from people like Intef and Bob2) this set will also include constraints defined in many ontologies that we will not be able to foresee (including the ontologies of a great variety of non-neurotypical individuals, that have been exposed to a great variety of value systems during childhood). This is an unusual feature of the AI context (compared to other contexts that deal with human preferences). A preference defined in an ontology that no one ever imagined might exist, has no impact on debates about economic policy. But unless one simply states that a rule should ignore any preference that was not considered by the designers, then the quest to find a rule that actually implies a non-empty set, must deal with this highly unusual feature of the AI context.
(Intef and Ahmose pose a lot more problems in this step, than they pose in the step where delegates are negotiating. In that later step, their delegates have no problematic advantage. Their delegates are also not trying to implement anything worse than extinction. This is probably why this type of person has not been part of any of my past thought experiments. I have not thought deeply about people like Intef and Ahmose)
(There exists several contemporary examples of this general type of disagreements over sacred locations or objects. Even the specific example of reusing temple stones was a common behaviour in many different times and places. But the ancient Egyptians are the undisputed champions of temple stone reuse. And people nowadays don’t really have strong opinions regarding which version of ancient Egyptian religion is the right version. Which is why I think it makes sense to use this example)
(I’m happy to keep exploring this issue. I would not be surprised if this line of inquiry leads to some interesting insight)
(if you are looking for related literature, you might want to take a look at the Sen ``Paradox″ (depending on how one defines “pathological preferences”, they may or may not be related to “nosy preferences”))
(Technical note: this discussion makes a series of very optimistic assumptions in order to focus on problems that remain despite these assumptions. For example assuming away a large number of very severe definitional issues. Reasoning from such assumptions does not make sense if one is arguing that a given proposal would work. But it does make sense when one is showing that a given proposal fails, even if one makes such optimistic assumptions. This point also applies to the next section)
Coherent Extrapolation of Equanimous Volition (CEEV)
Summary: In the CEEV proposal described in your point 1, many different types of fanatics would still be represented by delegates that want outcomes where heretics are punished. For example fanatics that would see a non-punishing AI as unethical. Which means that CEEV still suffers from the problem that was illustrated by the original PCEV thought experiment. In other words: having utility inverting preferences is one possible reason to want an outcome where heretics are punished. Such preferences would not be present in CEEV delegates. But another reason to want an outcome where heretics are punished is a general aversion to unethical AIs. Removing utility inverting preferences from CEEV delegates would not remove their aversion to unethical AIs. Yet another type of sentiment that would be passed on to CEEV delegates, is the case where someone would want heretics to be subjected to some specific type of treatment (simply because, all else equal, it would be sort of nice if the universe ended up like this). There are many other types of sentiments along these lines that would also be passed on to CEEV delegates (including a great variety of sentiments that we have no hope of comprehensively cataloguing). Which means that many different types of CEEV delegates would still want an outcome where heretics are hurt. All of those delegates would still have a very dramatic advantage in CEEV negotiations.
Let’s start by noting that fanatics can gain a very dramatic negotiation advantage in delegate negotiations, without being nearly as determined as Gregg or Bob. Unlike the situation discussed in the previous section, in delegate negotiations people just need to weakly prefer an outcome where heretics are subjected to some very unpleasant treatment. In other words: people can gain a very dramatic negotiation advantage simply because they feel that (all else equal) it would be sort of nice to have some type of outcome, that for some reason involves bad things happening to heretics.
There exists a great variety of reasons for why someone might have such sentiments. In other words: some types of fanatics might lose their negotiation advantage in CEEV. But many types of fanatics would retain their advantage (due to a great variety of preferences defined in a great variety of ontologies). Which in turn means that CEEV suffers from the same basic problem that PCEV suffers from.
You mention the possibility that an AI might lie to a fanatic regarding what is happening. But a proposed outcome along such lines would change nothing. CEEV delegates representing fanatics that have an aversion to unethical AIs would for example have no reason to accept such an outcome. Because the preferences of the fanatics in question is not about their beliefs regarding unethical AIs. Their preferences are about unethical AIs.
In addition to fanatics with an aversion to unethical AIs, we can also look at George, who wants heretics to be punished as a direct preference (without any involvement of preferences related to unethical AIs). George might for example want all heretics to be subjected to some specific treatment (demands that heretics be subjected to some specific treatment are not unusual). No need for anything complicated or deeply felt. George might simply feel that it would be sort of nice if the universe would be organised like this (all else equal).
George could also want the details of the treatment to be worked out by a clever AI (without referring to any form of utility inversion or suffering. Or even referring in any way to any heretic, when specifying the details of the treatment). George might for example want all heretics to be put in whatever situation, that would make George feel the greatest amount of regret. In other words: this type of demand does not have to be related to any form of utility inversion. The details of the treatment that George would like heretics to be subjected to, does not even need to be determined by any form of reference to any heretic. In yet other words: there are many ways for fanatics along the lines of George to gain a very large negotiation advantage in CEEV. (The proposal that CEEV might lie to George about what is happening to heretics would change nothing. Because George’s preference is not about George’s beliefs.)
The type of scenario that you describe, where George might want to see Dave being hurt, is not actually an issue here. Let’s look more generally at George’s preferences regarding George’s experiences, George’s beliefs, George’s world model, etc. None of those pose a problem in original PCEV (because they do not result in a negotiation advantage for George’s delegate). (We might not have any actual disagreement regarding these types of preferences. I just wanted to be clear about this point).
From the perspective of Steve, the underlying issue with CEEV is that Steve still has no meaningful control over the way in which CEEV adopts those preferences that refer to Steve. Which in turn means that Steve still has no reason to think that CEEV will want to help Steve, as opposed to want to hurt Steve. This point would remain true even if one were to remove additional types of preferences from delegates.
Eliminating some specific list of preference types (for example along the lines of: utility inverting preferences, or hostile preferences, or preferences that people suffer, etc) does not qualitatively change this situation. Because eliminating such a list of preference types does not result in Steve gaining meaningful influence regarding the adoption of those preferences that refer to Steve. Which in the case of Parliamentarian Negotiations means that delegates will still want to hurt Steve, for a great variety of reasons (for example due to sentiments along the lines of an aversion to unethical AIs. And also due to a long and varied list of other types of sentiments, that we have no hope of exhaustively cataloguing).
In other words: all those delegates that (for reasons related to a great variety of sentiments) still want outcomes where people are subjected to horrific forms of treatment, will still have a very large negotiation advantage in CEEV. And such delegates will also have a very large negotiation advantage in any other proposal without the SPADI feature, that is based on the idea of eliminating some other specific list of preference types from delegates.
Since this discussion is exploring hypotheticals (as a way of reaching new insights), I’m happy to keep looking at proposals without the SPADI feature. But given the stakes, I do want to make a tangential point regarding plans that are supposed to end with a successfully implemented AI without the SPADI feature (presumably as the end point of some larger plan that includes things along the lines of: an AI pause, augmented humans, an initial Limited AI, etc, etc).
In other words: I am happy to keep analysing proposals without the SPADI feature. Because it is hard to predict what one will find when one is pulling on threads like this. And because analysing a dangerous proposal reduces the probability of it being implemented. But I also want to go on a tangent and explain why successfully implementing any AI without the SPADI feature would be extremely bad. And explicitly note that this is true regardless of which specific path one takes to such an AI. And also explicitly note that this is true, regardless of whether or not anyone manages to construct a specific thought experiment illustrating the exact way in which things go bad.
Let’s look at a hypothetical future proposal to illustrate these two points. Let’s say that someone proposes a plan that is supposed to eventually lead to the implementation of an AI that gets its preferences from billions of humans. This AI does not have the SPADI feature. Now let’s say that this proposed alignment target avoids the specific issues illustrated by all existing thought experiments. Let’s further say that no one is able to construct a specific thought experiment that illustrates exactly how this novel alignment target proposal would lead to a bad outcome. The absence of a thought experiment that illustrates the specific path to a bad outcome, would not in any way shape or form imply that the resulting AI does not want to hurt Steve, if such a proposed plan is successfully implemented. In other words: since Steve will have no meaningful influence regarding the adoption of those preferences that refer to Steve, Steve will have no reason to expect the actual resulting AI to want to help Steve, as opposed to want to hurt Steve. PCEV implied a massively worse than extinction outcome, also before the specific problem was described (and PCEV spent a lot of years as a fairly popular proposal without anyone noticing the issue).
In yet other words: the actual AI, that is actually implied, by some proposed set of definitions, can end up wanting to hurt Steve, regardless of whether or not someone is able to construct a thought experiment that illustrates the exact mechanism by which this AI will end up wanting to hurt Steve. Which in combination with the fact that Steve does not have any meaningful influence regarding the adoption of those preferences that refer to Steve, means that Steve has no reason to expect this AI to want to help Steve, as opposed to want to hurt Steve.
In yet other words: the SPADI feature is far from sufficient for basic safety. But it really is necessary for basic safety. Which in turn means that if a proposed AI does not have the SPADI feature, then this AI is known to be extremely bad for human individuals in expectation (if successfully implemented). This is true with or without a specific thought experiment illustrating the specific mechanism that would lead to this AI wanting to hurt individuals. And it is true regardless of what path was taken to the successful implementation of such an AI. (Just wanted to be explicit about these points. Happy to keep analysisng proposals without the SPADI feature.)
(you might also want to take a look at this post)