ThomasCederborg comments on A problem with the most recently published version of CEV

ThomasCederborg 29 Jan 2024 19:41 UTC
1 point
0
I do think that the outcome would be LP (more below), but I can illustrate the underlying problem, using a set of alternative thought experiments, that does not require agreement on LP vs MP.
Let’s first consider the case where half of the heretics are seen as Mild Heretics (MH) and the other half as Severe Heretics (SH). MH are those that are open to converting, as part of a negotiated settlement (and SH are those that are not open to conversion). The Fanatics (F) would still prefer MP, where both MH and SH are hurt, as much as possible. But F is willing to agree to a Negotiated Position (NP), where MH escape punishment in exchange for conversion, but where SH is hurt, as much as possible, subject to a set of additional constraints. One such constraint would be a limit, on what types of minds, can be created and tortured, as a way of hurting SH.
F prefers MP, and will vote for MP unless MH agrees to vote for NP. Thus, agreeing to vote for NP is the only option available to MH, that would remove the possibility of them personally being targeted, by a powerful AI, using all its creativity, to think up clever ways of hurting them, as much as possible. This would also be their only way of reliably protecting some class of hypothetical future individuals, that they care about, and that would be created and hurt in MP. Thus, the negotiated position is NP.
This variant of the thought experiment is perhaps better at illustrating the deeply alien nature, of an arbitrarily defined abstract entity (given the label ``a Group″), that each individual would be subjected to, in case of the successful implementation, of any AI, that is describable as ``doing what a Group wants″ (the class of ``Group AI″ proposals, include all versions of CEV, as well as many other proposals). I think that this is far more dangerous, than an uncaring AI. In other words: a ``Group AI″ has preferences, that refer to you. But you have no meaningful influence, regarding the adoption of those preferences, that refer to you. That decision, just like every other decision, is entirely in the hands of an arbitrarily defined abstract entity (pointed at using an arbitrarily defined mapping, that maps sets of billions of human individuals, to the class of entities, that can be said to want things). My proposed way forward, is to explore designs, that gives each individual, meaningful influence, regarding the adoption of those preferences, that refer to her (doing so results in AI designs, that are no longer describable as ``doing what a Group wants″). I say more about this in my response to your second comment, to this post. But for the rest of this comment, I want to illustrate that the underlying issue is not actually dependent on agreement, with either of the two thought experiments discussed so far. Basically: I will argue that the conclusion, that PCEV is deeply problematic, is not dependent on agreement, on the details the these two thought experiments (in other words: I will outline an extended argument, for the premise, of your second comment).
First, it’s worth noting explicitly, that the NP outcome is obviously not bad, in any ``objective″ sense. If Bob likes the idea of sentient minds being tortured, then Bob will see NP as a good outcome. If Dave only cares about launching an AI, as soon as possible (and is fully indifferent to what AI is launched), then Dave will simply not see either of these two thought experiments, as relevant in any way. But I think that most readers will agree, that NP is a bad outcome.
Let’s turn to another class of thought experiments, that can be used to illustrate a less severe version of the same problem. Consider Steve, who wants everyone else to be punished. Steve is however willing to negotiate, and will agree to not vote for punishment, if he gets some extra bonus, that does not imply anyone else getting hurt (for example: above average influence regarding what should be done with distant galaxies. Or above average amount of resources to dispose of personally). The size of the bonus, is now strongly sensitive, to the severity of the punishment, that Steve wants to inflict on others. The more hateful Steve is, the larger bonus he gets. Yet again: this feature is not bad in any ``objective″ sense (Bob and Dave, mentioned above, wouldn’t see this as problematic in any way). But, I hope that most readers will agree, that building an AI, that behaves like this, is a bad idea.
We can also consider the animal rights activists Todd and Jeff. Both say that they strongly oppose the suffering of any sentient being. Todd actually does oppose all forms of suffering. Jeff is, sort of, telling the truth, but he is operating under the assumption, that everyone would want to protect animals, if they were just better informed. What Jeff actually wants, is moral purity. He wants other people to behave correctly. And, even more importantly, Jeff wants other people to have the correct morality. And, when Jeff is faced with the reality, that other people will not adopt the correct morality, even when they are informed about the details of factory farming, and given time for reflection, then Jeff will decide that they deserve to be punished, for lack of moral purity. In a situation where Jeff is in a weak political position, and when he is still able to convince himself, that most people are just misinformed, Jeff is not openly taking any totalitarian, or hateful, political positions. However, when Jeff finds out that most people, even when fully informed, and given time to reflect, would still choose to eat meat (in a counterfactual situation, where the AI is unable to provide meat, without killing animals), then he wants them punished (as a question of moral principle. They deserve it, because they are bad people. An AI that lets them avoid punishment, is an unethical AI). In a political conversation, it is essentially impossible to distinguish Todd from Jeff. So, a reasonable debate rule, is that you should conduct debates, as if all of your opponents (and all your allies) are like Todd. Accusing Todd of being like Jeff is unfair, since there is absolutely no way, for Todd to ever prove that he is not like Jeff. It is also an accusation, that can be levelled at most people, for taking essentially any normative, or political, position. So, having informal debate rules, stating that everyone should act, as if all people involved in the conversation, are like Todd, often makes a lot of sense. It is however a mistake to simply assume, that all people, really are, like Todd (or that people will remain like Todd, even when they are informed, about the fact, that other people are not simply misinformed. And that value differences, are, in fact, a real thing). In particular, when we are considering the question of ``what alignment target should be aimed at?″, then it is important to take into account, the fact that PCEV would give a single person like Jeff, far more power, than a large number of people like Todd. Even if a given political movement, is completely dominated by people like Todd, the influence within PCEV, from the members of this movement, would be dominated by people like Jeff. Even worse, is the fact that the issues that would end up dominating any PCEV style negotiation, are those issues that attract people along the lines of Jeff (in other words: what people think about the actual issue of animal rights, would probably not have any significant impact, on the actual outcome, of PCEV. If some currently discussed question, turned out to matter to the negotiations of extrapolated delegates, then this would probably be a type of issue, that tend to interest the ``heretics deserve eternal torture in hell″ crowd). So, while asking the question of ``what alignment target should be aimed at?″, it is actually very important, to take the existence of people like Jeff, into account.
(I use animal rights as a theme, because the texts that introduce PCEV, use this theme (and, as far as I can tell, PCEV is the current state of the art, in terms of answering the ``what alignment target should be aimed at?″ question). The underlying dynamic is, however, not connected to this particular issue. In fact, sentiments along the lines of ``heretics deserve eternal torture in hell″ have, historically, not had particularly strong ties to animal rights (such sentiments have been common, in many different times and places, throughout human history. But they do not seem to be common, amongst current animal rights movements). However, the animal rights issue, does work to illustrate the point (also: sticking with this existing theme, means that I don’t have to speculate out loud, regarding which existing group of people, are most like Jeff). Even though Jeff is a non standard type of fanatic, it is still perfectly possible, to use the power differential between Jeff and Todd, in PCEV, to illustrate the underlying problematic PCEV feature in question (since this feature of PCEV is not related to the specifics of the normative question under consideration, it is trivial to make the exact same point, using essentially any normative question / theme))

Regarding the validity of the thought experiment in the post:
If humans are mapped to utility functions, such that LP is close to maximally bad, then the negotiated outcome would indeed not be LP. However, I don’t think that this would be a reasonable mapping, because I think that a clever enough AI, would be capable of thinking up something, that is far worse than LP (more below).
Regarding Pascal’s Mugging. This term not is usually used, for these types of probabilities. If one in a hundred humans is a fanatic (or even one in a thousand), then I don’t think that it makes sense to describe this as Pascal’s Mugging. (for a set of individuals, such that LP and MP are basically the same, the outcome would indeed not be LP. But I still don’t think that it would count as a variant of Pascal’s Mugging) (perhaps I should not have used the phrase: ``tiny number of fanatics″. I did not mean ``tiny number″ in the ``negligible number″ sense. I was using it in the ``standard english″ sense)
I do not think that LP and MP will be even remotely similar. This assessment does not rely on the number of created minds in LP (or the number of years involved). Basically: everything that happens in LP, must be comprehensible to a heretic. That is not true for MP. And the comparison between LP and MP, is made by an extrapolated delegate.
In MP, the fanatics would ask an AI, to hurt the heretics as much as possible. So, for each individual heretic, the outcome in MP, is designed, by a very clever mind, specifically for the purpose of horrifying that heretic in particular (using an enormous amount of resources). The only constraint, is that any mind that is created by PCEV, must also be a heretic. In LP, the scenarios under consideration (that the 10^15 created minds will be subjected to), is limited to the set, that the heretic in question is capable of comprehending. Even if LP and MP would involve the same number of minds, and the same number of years, I still expect LP to be the negotiated outcome. MP is still the result of a very clever AI, that uses all of its creativity, to think up some outcome, specifically designed to horrify, this particular heretic. Betting against LP, as the negotiated outcome, means betting against the ability of a very powerful mind, to find a clever solution. In other words: I expect the outcome to be MP, for the same reason that I expect clever AI1, to defeat AI2 (who is equally clever, but is limited to strategies, that a human is capable of comprehending) in a war, even if AI2 starts with a lot more tanks. (if the sticking point is the phrase: ``the most horrific treatment, that this heretic is capable of comprehending″, then perhaps you will agree that the outcome would be LP, if the wording is changed to ``the most horrific treatment, that this heretic is capable of coming up with, given time to think, but without help from the AI″)