This seems to only be a problem if the individual advocates have vastly more optimization power than the AIs that check for non-aggression. I don’t think there’s any reason for that to be the case.
In contemporary society we generally have the opposite problem (the state uses lawfare against individuals).
My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).
If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).
In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.
There is a ``Misunderstands position?″ react to the following text:
The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules …
In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:
The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules …
I hope that this clarifies things.
The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve’s real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.
(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)
But the text that you link to does not suggest any mechanism, that would actually protect Steve
There is a baseline set of rules that exists for exactly this purpose, which I didn’t want to go into detail on in that piece because it’s extremely distracting from the main point. These rules are not necessarily made purely by humans, but could for example be the result of some kind of AI-assisted negotiation that happens at ELYSIUM Setup.
“There would also be certain baseline rules like “no unwanted torture, even if the torturer enjoys it”, and rules to prevent the use of personal utopias as weapons.”
But I think you’re correct that the system that implements anti-weaponization and the systems that implement extrapolated volitions are potentially pushing against each other. This is of course a tension that is present in human society as well, which is why we have police.
So basically the question is “how do you balance the power of generalized-police against the power of generalized-self-interest.”
Now the whole point of having “Separate Individualized Utopias” is to reduce the need for police. In the real world, it does seem to be the case that extremely geographically isolated people don’t need much in the way of police involvement. Most human conflicts are conflicts of proximity, crimes of opportunity, etc. It is rare that someone basically starts an intercontinental stalking vendetta against another person. And if you had the entire resources of police departments just dedicated to preventing that kind of crime, and they also had mind-reading tech for everyone, I don’t think it would be a problem.
I think the more likely problem is that people will want to start haggling over what kind of universal rights they have over other people’s utopias. Again, we see this in real life. E.g. “diverse” characters forced into every video game because a few people with a lot of leverage want to affect the entire universe.
So right now I don’t have a fully satisfactory answer to how to fix this. It’s clear to me that most human conflict can be transformed into a much easier negotiation over basically who gets how much money/general-purpose-resources. But the remaining parts could get messy.
One principled way to do it would be simulated war on narrow issues.
So if actor A spends resources R_c on computation C, any other actor B can surrender resources equal to R_c to prevent computation C from happening. The surrendered resources and the original resources are then physically destroyed (e.g. spent on Bitcoin mining or something).
This then at least means that to a first approximation, no actor has an incentive to destroy ELYSIUM itself in order to stop some computation inside it from happening, because they could just use their resources to stop the computation in the simulation instead. And many actors benefit from ELYSIUM, so there’s a large incentive to protect it.
And since the interaction is negative sum (both parties lose resources from their personal utopias) there would be strong reasons to negotiate.
In addition to this there could be rule-based and AI-based protections to prevent unauthorized funny tricks with simulations. One rule could be a sort of “cosmic block” where you can just block some or all other Utopias from knowing about you outside of a specified set of tests (“is torture happening here”, etc).
Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).
The proposal to have a simulated war that destroys resources
10 people without any large resource needs could use this mechanism to kill 9 people they don’t like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.
In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).
The Cosmic Block and a specific set of tests
This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here″ test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve’s worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.
So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.
I think that it would be better to just not create such a set of AIs
Let’s reason from the assumption that Bob’s Personal Advocate (BPA) is a clever AI that will be creating Bob’s Personalised Utopia. Let’s now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve’s safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve’s safety, than the question of what set of rules will govern Bob’s Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve’s safety, than the question of what set of rules will constrain the actions of BPA.
Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone’s values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.
It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don’t think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.
Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.
The question of what BPA wants to do to Steve, seems to me to be far more important for Steve’s safety, than the question of what set of rules will constrain the actions of BPA.
BPA shouldn’t be allowed to want anything for Steve. There shouldn’t be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can’t even know that Steve exists.
I think the difficult part is when BPA looks at Bob’s preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the “not-torture” bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them.
And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And ‘being forced to participate in a status hierarchy that you are not at the top of’ is a type of ‘generalized consensual harm’.
Even the good old Reedspacer’s Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls.
So I don’t think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity’s existence is actually miserable.
If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.
catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer
And, the problem with saying “OK let’s just ban the creation of catgirls” is that then maybe Reedspacer builds a volcano lair just for himself and plays video games in it, and the catgirls whose existence you prevented are going to scream bloody murder because you took away from them a very good existence that they would have enjoyed and also made Reedsapcer sad.
Steve will never become aware of what Bob is doing to OldSteve
But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?
I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob’s Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn’t trigger your torture detection test.
This kind of ‘marginal case of consensual torture’ has popped up in other similar discussions. E.g. In Yvain’s (Scott Alexander’s) article on Archipelago there’s this section:
“”“A child who is abused may be too young to know that escape is an option, or may be brainwashed into thinking they are evil, or guilted into believing they are betraying their families to opt out. And although there is no perfect, elegant solution here, the practical solution is that UniGov enforces some pretty strict laws on child-rearing, and every child, no matter what other education they receive, also has to receive a class taught by a UniGov representative in which they learn about the other communities in the Archipelago, receive a basic non-brainwashed view of the world, and are given directions to their nearest UniGov representative who they can give their opt-out request to”””
So Scott Alexander’s solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.
I think the issue here is that “people who unwisely consent to torture even after being told about it” and “people who are willing and consenting submissives” is not actually a hard boundary.
I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).
The first section argues that (unless Bob’s basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).
The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.
The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.
The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).
The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.
BPA will still want to find and hurt heretics
It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone’s basic moral framework. But I’m guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let’s reason from the assumption that it does not.
This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.
In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.
BPA will likely be able to find Steve
Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.
In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.
This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.
The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.
People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.
Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people’s Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.
Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.
Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve’s life would have left behind many visible effects.
The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).
There is still nothing protecting Steve from BPA
If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.
I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).
Memory wipes would complicate extrapolation
All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.
Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)
In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.
Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.
Your proposed resource destruction mechanism
No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).
Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.
The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.
Let’s call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.
First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don’t need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don’t gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.
More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.
a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.
If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.
And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn’t happen.
Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.
Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.
In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.
People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).
The ability to have civil disobedience without protection rackets does seem kind of crucial.
This seems to only be a problem if the individual advocates have vastly more optimization power than the AIs that check for non-aggression. I don’t think there’s any reason for that to be the case.
In contemporary society we generally have the opposite problem (the state uses lawfare against individuals).
My thought experiment assumed that all rules and constraints described in the text that you linked to had been successfully implemented. Perfect enforcement was assumed. This means that there is no need to get into issues such as relative optimization power (or any other enforcement related issue). The thought experiment showed that the rules described in the linked text does not actually protect Steve from a clever AI that is trying to hurt Steve (even if these rules are successfully implemented / perfectly enforced).
If we were reasoning from the assumption that some AI will try to prevent All Bad Things, then relative power issues might have been relevant. But there is nothing in the linked text that suggests that such an AI would be present (and it contains no proposal for how one might arrive at some set of definitions that would imply such an AI).
In other words: there would be many clever AIs trying to hurt people (the Advocates of various individual humans). But the text that you link to does not suggest any mechanism, that would actually protect Steve from a clever AI trying to hurt Steve.
There is a ``Misunderstands position?″ react to the following text:
The scenario where a clever AI wants to hurt a human that is only protected by a set of human constructed rules …
In The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt individual humans (the Advocates of various individual humans). So I assume that the issue is with the protection part of this sentence. The thought experiment outlined in my comment assumes perfect enforcement (and my post that this sentence is referring to also assumes perfect enforcement). It would have been redundant, but I could have instead written:
The scenario where a clever AI wants to hurt a human that is only protected by a set of perfectly enforced human constructed rules …
I hope that this clarifies things.
The specific security hole illustrated by the thought experiment can of course be patched. But this would not help. Patching all humanly findable security holes would also not help (it would prevent the publication of further thought experiments. But it would not protect anyone from a clever AI trying to hurt her. And in The ELYSIUM Proposal, there would in fact be many clever AIs trying to hurt people). The analogy with an AI in a box is apt here. If it is important that an AI does not leave a human constructed box (analogous to: an AI hurting Steve). Then one should avoid creating a clever AI that wants to leave the box (analogous to: avoid creating a clever AI that wants to hurt Steve). In other words: Steve’s real problem is that a clever AI is adopting preferences that refer to Steve, using a process that Steve has no influence over.
(Giving each individual influence over the adoption of those preferences that refer to her would not introduce contradictions. Because such influence would be defined in preference adoption space. Not in any form of action or outcome space. In The ELYSIUM Proposal however, no individual would have any influence whatsoever, over the process by which billions of clever AIs, would adopt preferences, that refer to her)
There is a baseline set of rules that exists for exactly this purpose, which I didn’t want to go into detail on in that piece because it’s extremely distracting from the main point. These rules are not necessarily made purely by humans, but could for example be the result of some kind of AI-assisted negotiation that happens at ELYSIUM Setup.
But I think you’re correct that the system that implements anti-weaponization and the systems that implement extrapolated volitions are potentially pushing against each other. This is of course a tension that is present in human society as well, which is why we have police.
So basically the question is “how do you balance the power of generalized-police against the power of generalized-self-interest.”
Now the whole point of having “Separate Individualized Utopias” is to reduce the need for police. In the real world, it does seem to be the case that extremely geographically isolated people don’t need much in the way of police involvement. Most human conflicts are conflicts of proximity, crimes of opportunity, etc. It is rare that someone basically starts an intercontinental stalking vendetta against another person. And if you had the entire resources of police departments just dedicated to preventing that kind of crime, and they also had mind-reading tech for everyone, I don’t think it would be a problem.
I think the more likely problem is that people will want to start haggling over what kind of universal rights they have over other people’s utopias. Again, we see this in real life. E.g. “diverse” characters forced into every video game because a few people with a lot of leverage want to affect the entire universe.
So right now I don’t have a fully satisfactory answer to how to fix this. It’s clear to me that most human conflict can be transformed into a much easier negotiation over basically who gets how much money/general-purpose-resources. But the remaining parts could get messy.
One principled way to do it would be simulated war on narrow issues.
So if actor A spends resources R_c on computation C, any other actor B can surrender resources equal to R_c to prevent computation C from happening. The surrendered resources and the original resources are then physically destroyed (e.g. spent on Bitcoin mining or something).
This then at least means that to a first approximation, no actor has an incentive to destroy ELYSIUM itself in order to stop some computation inside it from happening, because they could just use their resources to stop the computation in the simulation instead. And many actors benefit from ELYSIUM, so there’s a large incentive to protect it.
And since the interaction is negative sum (both parties lose resources from their personal utopias) there would be strong reasons to negotiate.
In addition to this there could be rule-based and AI-based protections to prevent unauthorized funny tricks with simulations. One rule could be a sort of “cosmic block” where you can just block some or all other Utopias from knowing about you outside of a specified set of tests (“is torture happening here”, etc).
Implementing The ELYSIUM Proposal would lead to the creation of a very large, and very diverse, set of clever AIs that wants to hurt people: the Advocates of a great variety of humans, that wants to hurt others in a wide variety of ways, for a wide variety of reasons. Protecting billions of people from this set of clever AIs would be difficult. As far as I can tell, nothing that you have mentioned so far would provide any meaningful amount of protection from a set of clever AIs like this (details below). I think that it would be better to just not create such a set of AIs in the first place (details below).
Regarding AI assisted negotiations
I don’t think that it is easy to find a negotiation baseline for AI-assisted negotiations that results in a negotiated settlement that actually deals with such a set of AIs. Negotiation baselines are non trivial. Reasonable sounding negotiation baselines can have counterintuitive implications. They can imply power imbalance issues that are not immediately obvious. For example: the random dictator negotiation baseline in PCEV gives a strong negotiation advantage to people that intrinsically values hurting other humans. This went unnoticed for a long time. (It has been suggested that it might be possible to find a negotiation baseline (a BATNA) that can be viewed as having been acausally agreed upon by everyone. However, it turns out that this is not actually possible for a group of billions of humans).
The proposal to have a simulated war that destroys resources
10 people without any large resource needs could use this mechanism to kill 9 people they don’t like at basically no cost (defining C as any computation done within the Utopia of the person they want to kill). Consider 10 people that just want to live a long life, and that do not have any particular use for most of the resources they have available. They can destroy all computational resources of 9 people without giving up anything that they care about. This also means that they can make credible threats. Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.
This mechanism does not rule out scenarios where a lot of people would strongly prefer to destroy ELYSIUM. A trivial example would be a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority. In this scenario almost half of all people would very strongly prefer to destroy ELYSIUM. Such a majority could alternatively credibly threaten the minority and force them to modify the way they live their lives. The threat would be especially credible if the majority likes the scenario where a minority is punished for refusing to conform.
In other words: this mechanism seems to be incompatible with your description of personalised Utopias as the best possible place to be (subject only to a few non intrusive ground rules).
The Cosmic Block and a specific set of tests
This relies on a set of definitions. And these definitions would have to hold up against a set of clever AIs trying to break them. None of the rules that you have proposed so far would prevent the strategy used by BPA to punish Steve, outlined in my initial comment. OldSteve is hurt in a way that is not actually prevented by any rule that you have described so far. For example: the ``is torture happening here″ test would not trigger for what is happening to OldSteve. So even if Steve does in principle have the ability to stop this by using some resources destroying mechanism, Steve will not be able to do so. Because Steve will never become aware of what Bob is doing to OldSteve. Steve considers OldSteve to be himself in a relevant sense. So, according to Steve’s worldview, Steve will experience a lot of very unpleasant things. But the only version of Steve that would be able to pay resources to stop this, would not be able to do so.
So the security hole pointed out by me in my original thought experiment is still not patched. And patching this security hole would not be enough. To protect Steve, one would need to find a set of rules that preemptively patches every single security hole that one of these clever AIs could ever find.
I think that it would be better to just not create such a set of AIs
Let’s reason from the assumption that Bob’s Personal Advocate (BPA) is a clever AI that will be creating Bob’s Personalised Utopia. Let’s now again take the perspective of ordinary human individual Steve, that gets no special treatment. I think the main question that determines Steve’s safety in this scenario, is how BPA is adopting Steve-referring-preferences. I think this is far more important for Steve’s safety, than the question of what set of rules will govern Bob’s Personalised Utopia. The question of what BPA wants to do to Steve, seems to me to be far more important for Steve’s safety, than the question of what set of rules will constrain the actions of BPA.
Another way to look at this is to think in terms of avoiding contradictions. And in terms of making coherent proposals. A proposal that effectively says that everyone should be given everything that they want (or effectively says that everyone’s values should be respected) is not a coherent proposal. These things are necessarily defined in some form of outcome or action space. Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.
This can be contrasted with giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her. Since this is defined in preference adoption space, it cannot guarantee that everyone will get everything that they want. But it also means that it does not imply contradictions (see this post for a discussion of these issues in the context of Membrane formalisms). Giving everyone such influence is a coherent proposal.
It also happens to be the case that if one wants to protect Steve from a far superior intellect, then preference adoption space seems to be a lot more relevant than any form of outcome or action space. Because if a superior intellect wants to hurt Steve, then one has to defeat a superior opponent in every single round of a near infinite definitional game (even under the assumption of perfect enforcement, winning every round in such a definitional game against a superior opponent seems hopeless). In other words: I don’t think that the best way to approach this is to ask how one might protect Steve from a large set of clever AIs that wants to hurt Steve for a wide variety of reasons. I think a better question is to ask how one might prevent the situation where such a set of AIs wants to hurt Steve.
Influence over preferences of a single entity is much more conflict-y.
The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.
BPA shouldn’t be allowed to want anything for Steve. There shouldn’t be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can’t even know that Steve exists.
I think the difficult part is when BPA looks at Bob’s preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the “not-torture” bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them.
And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And ‘being forced to participate in a status hierarchy that you are not at the top of’ is a type of ‘generalized consensual harm’.
Even the good old Reedspacer’s Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls.
So I don’t think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity’s existence is actually miserable.
If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.
And, the problem with saying “OK let’s just ban the creation of catgirls” is that then maybe Reedspacer builds a volcano lair just for himself and plays video games in it, and the catgirls whose existence you prevented are going to scream bloody murder because you took away from them a very good existence that they would have enjoyed and also made Reedsapcer sad.
But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?
I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob’s Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn’t trigger your torture detection test.
This kind of ‘marginal case of consensual torture’ has popped up in other similar discussions. E.g. In Yvain’s (Scott Alexander’s) article on Archipelago there’s this section:
So Scott Alexander’s solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.
I think the issue here is that “people who unwisely consent to torture even after being told about it” and “people who are willing and consenting submissives” is not actually a hard boundary.
I thought that your Cosmic Block proposal would only block information regarding things going on inside a given Utopia. I did not think that the Cosmic Block would subject every person to forced memory deletion. As far as I can tell, this would mean removing a large portion of all memories (details below). I think that memory deletion on the implied scale would seriously complicate attempts to define an extrapolation dynamic. It also does not seem to me that it would actually patch the security hole illustrated by the thought experiment in my original comment (details below).
The first section argues that (unless Bob’s basic moral framework has been dramatically changed by the memory deletion) no level of memory deletion will prevent BPA from wanting to find and hurt Steve. In brief: BPA will still be subject to the same moral imperative to find and hurt any existing heretics (including Steve).
The second section argues that BPA is likely to find Steve. In brief: BPA is a clever AI and the memory deletion is a human constructed barrier (the Advocates are extrapolations of people that has already been subjected to these memory wipes. So Advocates cannot be involved when negotiating the rules governing these memory wipes). BPA would still have access to a lot of different information sources that it can use to find Steve.
The third section argues that if BPA finds Steve, then BPA would be able to hurt Steve. In brief: creating OldSteve is still not prevented by any rule or constraint that you have mentioned so far.
The fourth section argues that the side effects of memory deletion would be severe. In brief: memories of every conversation about any deleted person would also be deleted. Besides all direct memories involving any deleted person, many indirect memories would also be deleted. This would seriously complicate extrapolation. (Extrapolation is already a very tricky definitional problem. And this definitional problem cannot be delegated to extrapolated Advocates, since they are the result of the extrapolation dynamic being defined).
The last section deals with your proposed resource destruction mechanism. In brief: in one thought experiment (that did not involve threats) almost half the population would very strongly prefer to destroy ELYSIUM. This disproved your claim that such a situation would not arise. Also: no thought experiment involved anyone trying to gain resources. And no thought experiment involved anyone issuing any form of threat that would not be followed up on. People burned resources that they did not need. They did this to either hurt people directly. Or to issue genuinely credible threats.
BPA will still want to find and hurt heretics
It would not be surprising if a comprehensive, human defined, memory deletion operation would completely re write someone’s basic moral framework. But I’m guessing that your proposed memory deletion is not supposed to be done in a way that changes a persons basic moral framework. So let’s reason from the assumption that it does not.
This means that Bob still considers BPA to be bound by a non negotiable moral imperative. So BPA still wants to find and punish any heretic that might exist.
In other words: the Cosmic Block is not the type of mechanism that might prevent BPA from wanting to find and hurt Steve. It is instead a practical obstacle that BPA needs to overcome (which is something very different). It is a human constructed practical barrier, that is supposed to protect Steve from a clever AI that wants to find and hurt Steve.
BPA will likely be able to find Steve
Unless the details of your proposed Cosmic Block are constructed by an AI that prevents All Bad Things, these rules must come from somewhere else. AI assisted negotiations cannot be done by the Advocates. Advocates are the result of extrapolating memory wiped people (otherwise the whole point of the Cosmic Bloc is lost). So the Advocates cannot be involved in defining the memory wipe rules.
In other words: unless the memory wipe rules are negotiated by a completely separate set of (previously unmentioned) AIs, the memory wipe rules will be human defined.
This means that a human constructed barrier must hold against a clever AI trying to get around it. Even if we were to know that a human defined barrier has no humanly-findable security holes, this does not mean that it will actually hold against a clever AI. A clever AI can find security holes that are not humanly-findable.
The specific situation that BPA will find itself in does not seem to be described in sufficient detail for it to be possible to outline a specific path along which BPA finds Steve. But from the currently specified rules, we do know that BPA has access to several ways of gathering information about Steve.
People can pool resources (as described in your original proposal). So Advocates can presumably ask other Advocates about potential partners for cohabitation. Consider the case where BPA is negotiating with other Advocates regarding who will be included in a potential shared environment. This decision will presumably involve information about potential candidates. Whether or not a given person is accepted, would presumably depend on detailed personal information.
Advocates can also engage in mutual resource destruction to prevent computations happening within other Utopias. You describe this mechanism as involving negotiations between Advocates, regarding computations happening within other people’s Utopias. Such negotiations would primarily be between the Advocates of people that have very different values. This is another potential information source about Steve.
Steve would also have left a lot of effects on the world, besides effects on peoples memories. Steve might for example have had a direct impact on what type of person someone else has turned into. Deleting this impact would be even more dramatic than deleting memories.
Steve might have also have had a significant impact on various group dynamics (for example: his family, the friend groups that he has been a part of, different sets of coworkers and classmates, online communities, etc). Unless all memories regarding the general group dynamics of every group that Steve has been a part of is deleted, Steve’s life would have left behind many visible effects.
The situation is thus that a clever AI is trying to find and hurt Steve. There are many different types of information sources that can be combined in clever ways to find Steve. The rules of all barriers between this AI and Steve are human constructed. Even with perfect enforcement of all barriers, this still sounds like a scenario where BPA will find Steve (for the same reason that a clever AI is likely to find its way out of a human constructed box, or around a human constructed Membrane).
There is still nothing protecting Steve from BPA
If BPA locates Steve, then there is nothing preventing BPA from using OldSteve to hurt Steve. What is happening to OldSteve is still not prevented by any currently specified rule. The suffering of OldSteve is entirely caused by internal dynamics. OldSteve never lacks any form of information. And the harm inflicted on OldSteve is not in any sense marginal.
I do not see any strong connections between the OldSteve thought experiment and your Scott Alexander quote (which is concerned with the question of what options and information should be provided by a government run by humans. To children raised by other humans). More generally: scenarios that include a clever AI that is specifically trying to hurt someone, has a lot of unique properties (important properties that are not present in scenarios that lack such an AI). I think that these scenarios are dangerous. And I think that they should be avoided (as opposed to first created and then mitigated). (Avoiding such scenarios is a necessary, but definitely not sufficient, feature of an alignment target).
Memory wipes would complicate extrapolation
All deleted memories must be so thoroughly wiped that a clever AI will be unable to reconstruct them (otherwise the whole point of the Cosmic Block is negated). Deleting all memories of a single important negative interpersonal relationship would be a huge modification. Even just deleting all memories of one famous person that served as a role model would be significant.
Thoroughly deleting your memory of a person, would also impact your memory of every conversation that you have ever had about this person. Including conversations with people that are not deleted. Most long term social relationships involves a lot of discussions of other people (one person describing past experiences to the other, discussions of people that both know personally, arguments over politicians or celebrities, etc, etc). Thus, the memory deletion would significantly alter the memories of essentially all significant social relationships. This is not a minor thing to do to a person. (That every person would be subjected to this is not obviously implied by the text in The ELYSIUM Proposal.)
In other words: even memories of non deleted people would be severely modified. For example: every discussion or argument about a deleted person would be deleted. Two people (that do not delete each other) might suddenly have no idea why they almost cut all contact a few years ago, and why their interactions has been so different for the last few years. Either their Advocates can reconstruct the relevant information (in which case the deletion does not serve its purpose). Or their Advocates must try to extrapolate them while lacking a lot of information.
Getting the definitions involved in extrapolation right, seems like it will be very difficult even under ordinary circumstances. Wide ranging and very thorough memory deletion would presumably make extrapolation even more tricky. This is a major issue.
Your proposed resource destruction mechanism
No one in any of my thought experiments was trying to get more resources. The 55 percent majority (and the group of 10 people) have a lot of resources that they do not care much about. They want to create some form of existence for themselves. This only takes a fraction of available resources to set up. They can then burn the rest of their resources on actions within the resource destruction mechanism. They either burn these resources to directly hurt people. Or they risk these resources by making threats that are completely credible. In the thought experiments where someone does issue a threat, the threat is issued because: a person giving in > burning resources to hurt someone who refuses > leaving someone that refuses alone. They are perfectly ok with an outcome where resources are spent on hurting someone that refuses to comply (they are not self modifying as a negotiation strategy. They just think that this is a perfectly ok outcome).
Preventing this type of threats would be difficult because (i): negotiations are allowed, and (ii): in any scenario where threats are prevented, the threatened action would simply be taken (for non strategic reasons). There is no difference in behaviour between scenarios where threats are prevented, and scenarios where threats are ignored.
The thought experiment where a majority burns resources to hurt a minority was a simple example scenario where almost half of the population would very strongly prefer to destroy ELYSIUM (or strongly prefer that ELYSIUM was never created). It was a response to your claim that your resource destruction mechanisms would prevent such a scenario. This thought experiment did not involve any form of threat or negotiation.
Let’s call a rule that prevents the majority from hurting the minority a Minority Protection Rule (MPR). There are at least two problems with your claim that a pre-AI majority would prevent the creation of a version of ELYSIUM that has an MPR.
First: without an added MPR, the post-AI majority is able to hurt the minority without giving up anything that they care about (they burn resources they don’t need). So there is no reason to think that an extrapolated post-AI majority would want to try to prevent the creation of a version of ELYSIUM with an MPR. They would prefer the case without an MPR. This does not imply that they care enough to try to prevent the creation of a version of ELYSIUM with an MPR. Doing so would presumably be very risky, and they don’t gain anything that they care much about. When hurting the minority does not cost them anything that they care about, they do it. That does not imply that this is an important issue for the majority.
More importantly however: you are conflating, (i): a set of un-extrapolated and un-coordinated people living in a pre-AI world, with (ii): a set of clever AI Advocates representing these same people, operating in a post-AI world. There is nothing unexpected about humans opposing / supporting an AI that would be good / bad for them (from the perspective of their extrapolated Advocates). That is the whole point of having extrapolated Advocates.
If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.
And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn’t happen.
Re:threats, see my other comment.
This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.
Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.
In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.
People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).
The ability to have civil disobedience without protection rackets does seem kind of crucial.