ThomasCederborg comments on A list of core AI safety problems and how I hope to solve them

ThomasCederborg 28 Apr 2024 1:48 UTC
13 points
4
There is a serious issue with your proposed solution to problem 13. Using a random dictator policy as a negotiation baseline is not suitable for the situation, where billions of humans are negotiating about the actions of a clever and powerful AI. One problem with using this solution, in this contexts, is that some people have strong commitments to moral imperatives, along the lines of ``heretics deserve eternal torture in hell″. The combination of these types of sentiments, and a powerful and clever AI (that would be very good at thinking up effective ways of hurting heretics), leads to serious problems when one uses this negotiation baseline. A tiny number of people with sentiments along these lines, can completely dominate the outcome.
Consider a tiny number of fanatics with this type of morality. They consider everyone else to be heretics, and they would like the AI to hurt all heretics as much as possible. Since a powerful and clever AI would be very good at hurting a human individual, this tiny number of fanatics, can completely dominate negotiations. People that would be hurt as much as possible (by a clever and powerful AI), in a scenario where one of the fanatics are selected as dictator, can be forced to agree to very unpleasant negotiated positions, if one uses this negotiation baseline (since agreeing to such an unpleasant outcome, can be the only way to convince a group of fanatics, to agree to not ask the AI to hurt heretics, as much as possible, in the event that a fanatic is selected as dictator).
This post, explore these issues in the context of the most recently published version of CEV: Parliamentarian CEV (PCEV). PCEV has a random dictator negotiation baseline. The post shows that PCEV results in an outcome massively worse than extinction (if PCEV is successfully implemented, and pointed at billions of humans).
Another way to look at this, is to note that the concept of ``fair Pareto improvements″ has counterintuitive implications, when the question is about AI goals, and some of the people involved, has this type of morality. The concept was not designed with this aspect of morality in mind. And it was not designed to apply to negotiations about the actions of a clever and powerful AI. So, it should not be very surprising, to discover that the concept has counterintuitive implications, when used in this novel context. If some change in the world improves the lives of heretics, then this is making the world worse, from the perspective of those people, that would ask an AI to hurt all heretics as much as possible. For example: reducing the excruciating pain of a heretic, in a way that does not affect anyone else in any way, is not a ``fair Pareto improvement″, in this context. If every person is seen as a heretic by at least one group of fanatics, then the concept of ``fair Pareto improvements″ has some very counterintuitive implications, when it is used in this context.
Yet another way of looking at this, is to take the perspective of human individual Steve, who will have no special influence over an AI project. In the case of an AI, that is describable as doing what a group wants, Steve has a serious problem (and this problem is present, regardless of the details of the specific Group AI proposal). From Steve’s perspective, the core problem, is that an arbitrarily defined abstract entity, will adopt preferences, that is about Steve. But, if this is any version of CEV (or any other Group AI), directed at a large group, then Steve has had no meaningful influence, regarding the adoption of those preferences, that refer to Steve. Just like every other decision, the decision of what Steve-preferences the AI will adopt, is determined by the outcome of an arbitrarily defined mapping, that maps large sets of human individuals, into the space of entities that can be said to want things. Different sets of definitions, lead to completely different such ``Group entities″. These entities all want completely different things (changing one detail can for example change which tiny group of fanatics, will end up dominating the AI in question). Since the choice of entity is arbitrary, there is no way for an AI to figure out that the mapping ``is wrong″ (regardless of how smart this AI is). And since the AI is doing what the resulting entity wants, the AI has no reason to object, when that entity wants the AI to hurt an individual. Since Steve does not have any meaningful influence, regarding the adoption of those preferences, that refer to Steve, there is no reason for him to think that such an AI will want to help him, as opposed to want to hurt him. Combined with the vulnerability of a human individual, to a clever AI that tries to hurt that individual as much as possible, this means that any group AI would be worse than extinction, in expectation.
Discovering that doing what a group wants, is bad for human individuals in expectation, should not be particularly surprising. Groups and individuals are completely different types of things. So, this should be no more surprising, than discovering that any reasonable way of extrapolating Dave, will lead to the death of every single one of Dave’s cells. Doing what one type of thing wants, might be bad for a completely different type of thing. And aspects of human morality, along the lines of ``heretics deserve eternal torture in hell″ shows up throughout human history. It is found across cultures, and religions, and continents, and time periods. So, if an AI project is aiming for an alignment target, that is describable as ``doing what a group wants″, then there is really no reason for Steve to think, that the result of a successful project, would want to help him, as opposed to want to hurt him. And given the large ability of an AI to hurt a human individual, the success of such a project would be massively worse than extinction (in expectation).
The core problem, from the perspective of Steve, is that Steve has no control over the adoption of those preferences, that refer to Steve. One can give each person influence over this decision, without giving anyone any preferential treatment (see for example MPCEV in the post about PCEV, mentioned above). Giving each person such influence, does not introduce contradictions, because this influence is defined in ``AI preference adoption space″, not in any form of outcome space. This can be formulated as an alignment target feature that is necessary, but not sufficient, for safety. Let’s refer to this feature as the: Self Preference Adoption Decision Influence (SPADI) feature. (MPCEV is basically what happens, if one adds the SPADI feature to PCEV. Adding the SPADI feature to PCEV, solves the issue, illustrated by that thought experiment)
The SPADI feature is obviously very underspecified. There will be lots of border cases whose classification will be arbitrary. But there still exists many cases, where it is in fact clear, that a given alignment target, does not have the SPADI feature. Since the SPADI feature is necessary, but not sufficient, these clear negatives are actually the most informative cases. In particular, if an AI project is aiming for an alignment target, that clearly does not have the SPADI feature. Then the success of this AI project, would be worse than extinction, in expectation (from the perspective of a human individual, that is not given any special influence over the AI project). While there are many border cases, regarding what alignment targets could be described as having the SPADI feature, CEV is an example of a clear negative (in other words: there exists no reasonable set of definitions, according to which there exists a version of CEV, that has the SPADI feature). This is because building an AI that is describable as ``doing what a group wants″, is inherent in the core concept, of building an AI, that is describable as: ``implementing the Coherent Extrapolated Volition of Humanity″.
In other words: the field of alignment target analysis is essentially an open research question. This question is also (i): very unintuitive, (ii): very under explored, and (iii): very dangerous to get wrong. If one is focusing on necessary, but not sufficient, alignment target features. Then it is possible to mitigate dangers related to someone successfully hitting a bad alignment target, even if one does not have any idea of what it would mean, for an alignment target to be a good alignment target. This comment outlines a proposed research effort, aimed at mitigating this type of risk.
These ideas also have implications for the Membrane concept, as discussed here and here.
(It is worth noting explicitly that the problem is not strongly connected to the specific aspect of human morality discussed in the present comment (the ``heretics deserve eternal torture in hell″ aspect). The problem is about the lack of meaningful influence, regarding the adoption of self referring preferences. In other words, it is about the lack of the SPADI feature. It just happens to be the case, that this particular aspect of human morality is both (i): ubiquitous throughout human history, and also (ii): well suited for constructing thought experiments, that illustrates the dangers of alignment target proposals, that lack the SPADI feature. If this aspect of human morality disappeared tomorrow, the basic situation would not change (the illustrative thought experiments would change. But the underlying problem would remain. And the SPADI feature would still be necessary for safety).)
- davidad 3 May 2024 5:40 UTC
  7 points
  0
  Parent
  The “random dictator” baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for “Pareto improvement” being “no superintelligence”). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.
  What links here?
  - A necessary Membrane formalism feature by ThomasCederborg (10 Sep 2024 21:33 UTC; 20 points)
  - A necessary Membrane formalism feature by ThomasCederborg (EA Forum; 10 Sep 2024 21:03 UTC; 1 point)
  - ThomasCederborg 3 May 2024 21:44 UTC
    11 points
    4
    Parent
    Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.
    In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the table. The existence of such an AI, would be seen as intrinsically bad, by people that see hurting heretics as a moral imperative (for example: Gregg really does not want a world, where Gregg has agreed to tolerate the existence of an unethical AI, that disregards its moral duty, to punish heretics). More generally: anything that improves the lives of heretics, is off the table. If an outcome improves the lives of heretics (compared to the no-AI-baseline), then this outcome is also not a Pareto improvement. Because improving the lives of heretics, makes things worse from the point of view, of those that are deeply committed to hurting heretics.
    In yet other words: it only takes two individuals, to rule out any outcome, that contains any improvement, for any person. Gregg and Jeff are both deeply committed to hurting heretics. But their definitions of ``heretic″ differ. Every individual is seen as a heretic by at least one of them. So, any outcome, that makes life better for any person, is off the table. Gregg and Jeff does have to be very committed to the moral position, that the existence of any AI, that neglects its duty to punish heretics, is unacceptable. It must for example be impossible to get them to agree to tolerate the existence of such an AI, in exchange for increased influence over the far future. But a population of billions only has to contain two such people, for the set of Pareto improvements to be empty.
    Another way to approach this would be to ask: What would have happened, if someone had successfully implemented a Gatekeeper AI, built on top of a set of definitions, such that the set of Pareto improvements is empty?
    For the version of the random dictator negotiation baseline that you describe, this comment might actually be more relevant, than the PCEV thought experiment. It is a comment on the suggestion by Andrew Critch, that it might be possible to view a Boundaries / Membranes based BATNA, as having been agreed to acausally. It is impossible to reach such an acausal agreement when a group include people like Gregg and Jeff, for the same reason that it is impossible to find an outcome that is a Pareto improvement, when a group include people like Gregg and Jeff. (that comment also discuss ideas, for how one might deal with the dangers that arise, when one combines people like Gregg and Jeff, with a powerful and clever AI)
    Another way to look at this, would be to consider what it would mean to find a Pareto improvement, with respect to only Bob and Dave. Bob wants to hurt heretics, and Bob considers half of all people to be heretics. Dave is an altruist, that just wants people to have as good a life as possible. The set of Pareto improvements would now be made up entirely of different variations of the general situation: make the lives of non heretics much better, and make the lives of heretics much worse. For Bob to agree, heretics must be punished. And for Dave to agree, Dave must see the average life quality, as an improvement on the ``no superintelligence″ outcome. If the ``no superintelligence″ outcome is bad for everyone, then the lives of heretics in this scenario could get very bad.
    More generally: people like Bob (with aspects of morality along the lines of: ``heretics deserve eternal torture in hell″) will have dramatically increased power over the far future, when one uses this type of negotiation baseline (assuming that things have been patched, in a way that results in a non empty set of Pareto improvements). If everyone is included in the calculation of what counts as Pareto improvements, then the set of Pareto improvements is empty (due to people like Gregg and Jeff). And if everyone is not included, then the outcome could get very bad, for many people (compared to whatever would have happened otherwise).
    (adding the SPADI feature to your proposal would remove these issues, and would prevent people like Dave from being dis-empowered, relative to people like Bob. The details are importantly different from PCEV, but it is no coincidence that adding the SPADI feature removes this particular problem, for both proposals. The common denominator is that from the perspective of Steve, it is in general dangerous to encounter an AI, that has taken ``unwelcome″ or ``hostile″ preferences about Steve into account)
    Also: my general point about the concept of ``fair Pareto improvements″ having counterintuitive implications in this novel context still apply (it is not related to the details of any specific proposal).
    What links here?
    The case for more Alignment Target Analysis (ATA) by Chi Nguyen (20 Sep 2024 1:14 UTC; 25 points)
    A necessary Membrane formalism feature by ThomasCederborg (10 Sep 2024 21:33 UTC; 20 points)
    A necessary Membrane formalism feature by ThomasCederborg (EA Forum; 10 Sep 2024 21:03 UTC; 1 point)