Indifferent vs false-friendly AIs
A putative new idea for AI control; index here.
For anyone but an extreme total utilitarian, there is a great difference between AIs that would eliminate everyone as a side effect of focusing on their own goals (indifferent AIs) and AIs that would effectively eliminate everyone through a bad instantiation of human-friendly values (false-friendly AIs). Examples of indifferent AIs are things like paperclip maximisers, examples of false-friendly AIs are “keep humans safe” AIs who entomb everyone in bunkers, lobotomised and on medical drips.
The difference is apparent when you consider multiple AIs and negotiations between them. Imagine you have a large class of AIs, and that they are all indifferent (IAIs), except for one (which you can’t identify) which is friendly (FAI). And you now let them negotiate a compromise between themselves. Then, for many possible compromises, we will end up with most of the universe getting optimised for whatever goals the AIs set themselves, while a small portion (maybe just a single galaxy’s resources) would get dedicated to making human lives incredibly happy and meaningful.
But if there is a false-friendly AI (FFAI) in the mix, things can go very wrong. That is because those happy and meaningful lives are a net negative to the FFAI. These humans are running dangers—possibly physical, possibly psychological—that lobotomisation and bunkers (or their digital equivalents) could protect against. Unlike the IAIs, which would only complain about the loss of resources to the FAI, the FFAI finds the FAI’s actions positively harmful (and possibly vice versa), making compromises much harder to reach.
And the compromises reached might be bad ones. For instance, what if the FAI and FFAI agree on “half-lobotomised humans” or something like that? You might ask why the FAI would agree to that, but there’s a great difference to an AI that would be friendly on its own, and one that would choose only friendly compromises with a powerful other AI with human-relevant preferences.
Some designs of FFAIs might not lead to these bad outcomes—just like IAIs, they might be content to rule over a galaxy of lobotomised humans, while the FAI has its own galaxy off on its own, where its humans take all these dangers. But generally, FFAIs would not come about by someone designing a FFAI, let alone someone designing a FFAI that can safely trade with a FAI. Instead, they would be designing a FAI, and failing. And the closer that design got to being FAI, the more dangerous the failure could potentially be.
So, when designing an FAI, make sure to get it right. And, though you absolutely positively need to get it absolutely right, make sure that if you do fail, the failure results in a FFAI that can safely be compromised with, if someone else gets out a true FAI in time.
Alternatively, suppose that there is a parliament of all IAIs vs. a parliament of all FFAIs. It could be that the false-friendly AIs, who are each protecting some aspect of humanity, end up doing an ok job. Not optimal, but the AI who wants humans to be safe and the AI that wants humans to have fun, and the AI that wants humans to spread across the galaxy all together could lead to something not awful. The IAI parliament on the other hand just leads to the humans getting turned into resources for whichever of them can most convenient make use of our matter.
Notice that adding IAIs to the FFAIs does nothing more (according to many ways of resolving disagreements) than reducing the share of resources humanity gets.
But counting on a parliament of FFAIs to be finely balanced to get FAI out of it, without solving FAI along the way… seems a tad optimistic. You’re thinking of “this FFAI values human safety, this one values human freedom, they will compromise on safety AND freedom”. I’m thinking they will compromise on some lobotomy-bunker version of safety while running some tiny part of the brains to make certain repeated choices that technically count as “freedom” according to the freedom-FFAI’s utility.
I’m just brainstorming in the same vein as these posts, of course, so consider the epistemic status of these comments to be extremely uncertain. But, in the limit, if you have a large number of AIs (thousands, or millions, or billions) who each optimize for some aspect that humans care about, maybe the outcome wouldn’t be terrible, although perhaps not as good as one truly friendly AI. The continuity of experience AI could compromise with the safety AI and freedom AI and “I’m a whole brain experiencing things” AI and the “no tricksies” AI to make something not terrible.
Of course, people don’t care about so many aspects with equal weights, so if they all got equal weight, maybe the most likely failure mode is that something people only care about a tiny amount (e.g. not stepping on cracks in the sidewalk) gets equal weight with something people care about a lot (e.g. experiencing genuine love for another human) and everything gets pretty crappy. On the other hand, maybe there are many things that can be simultaneously satisfied, so you end up living in a world with no sidewalk-cracks and where you are immediately matched with plausible loves of your life, and while it may not be optimal, it may still be better than what we’ve got going on now.
I’ll think about it. I don’t think it will work, but there might be an insight there we can use.
The current system overrides difference. We elect a small group of humans to spend the taxes earned by a large group of humans. Your concern is that AIs would override difference. But, where’s your concern for our current system? Why is it ok for humans to override difference but not ok for AIs to override difference? Either you have a double standard… or you don’t realize that you support a system that overrides difference.
That doesn’t look to me at all like an accurate description of Stuart_Armstrong’s concern.
Please try to understand that not every discussion has to be about your obsession with taxes.
I decree that, from this day forward, every discussion has to be about my obsession with taxes. Not really. In case you didn’t get the memo… nobody here is forced to reply to my comments. That I know of. If you were forced to reply to my comments… then please let me know who overrode your difference. I will surely give them a stern and strongly worded lecture on the value of difference.
Of course SA’s concern is that AIs would override difference. Overriding difference means less freedom. If SA wasn’t concerned with AIs turning us humans into puppets… then he wouldn’t be obsessed with AI safety.
My question is… if he’s concerned with having our difference overridden… then why isn’t he concerned with our current system? It’s a perfectly legitimate and relevant question. Why is he ignoring the clear and present danger and focusing instead on an unclear and future danger?
I question the accuracy of your mental model of Stuart_Armstrong, and of your reading of what he wrote. There are many ways in which an insufficiently friendly AI could harm us, and they aren’t all about “overriding difference” or “less freedom”. If (e.g.) people are entombed in bunkers, lobotomized and on medical drips, lack of freedom is not their only problem. (I confess myself at a bit of a disadvantage here, because I don’t know exactly what you mean by “overriding difference”; it doesn’t sound to me equivalent to lacking freedom, for instance. Your love of neologism is impeding communication.)
I don’t believe you have any good reason to think he isn’t. All you know is that he is currently posting a lot of stuff about something else, and it appears that this bothers you.
Allow me to answer the question that I think is implicit in your first paragraph. The reason why I’m making a fuss about this is that you are doing something incredibly rude: barging into a discussion that has nothing at all to do with your pet obsession and trying to wrench the discussion onto the topic you favour. (And, in doing so, attacking someone who has done nothing to merit your attack.)
I have seen online communities destroyed by individuals with such obsessions. I don’t think that’s a serious danger here; LW is pretty robust. But, although you don’t have the power to destroy LW, you do (unfortunately) have the power to make every discussion here just a little bit more annoying and less useful, and I am worried that you are going to try, and I would like to dissuade you from doing it.
If by “overriding differences” you mean “cause the complete extinction of anything that could ever be called human, for ever and ever”.
And no, I don’t think it’s ok for humans to “cause the complete extinction of anything that could ever be called human, for ever and ever”, either.
An AI must compromise with the universe and only implement something physically possible. If it’s going to make sure this compromise is friendly, why wouldn’t it make a friendly compromise with an FFAI?
Because that’s an extra constraint: universe AND FFAI. The class of AIs that would be FAI with the universe, is larger than the class that would be FAI with the universe and an FFAI to deal with.
To pick a somewhat crude example, imagine an AI that maximises the soft-minimum of two quantities: human happiness and human preferences. It turns out each quantity is roughly equivalent in difficulty of satisfying (ie not too many order of magnitudes between them), so this is a FAI in our universe.
However, add a FFAI that hates human preferences and loves human happiness. Then the compromise might be on a very high happiness, which the previous FAI can live with (it was only a soft minimum, not a hard minimum).
Or maybe this is a better way of formulating things: there are FAIs, and AIs which act as FAIs given the expected conditions of the universe. It’s the second category that might be very problematic in negotiations.
Overriding differences is a continuum that ranges from very small overrides to very large overrides. But in all cases it boils down to an individual preferring option A but being forced to choose option Not A instead.
Great, we can cross this extremely large override from the list. How about we consider an override that actually does occur? With our current system… pacifists are forced to pay for war. This overrides their difference. Pacifists would prefer to choose option A (peace) but they are forced instead to choose option Not A (war). Do you support this overriding of difference? If so, then where do you draw the line and why do you draw it there?
If it helps, try and imagine that I’m an AI. Heck, for all you know I might be! If I am, then your first reply really didn’t convince me not to override your difference. But, I’m willing to give you a second chance. If you have no problem forcing a pacifist to pay for war… then why should you have a problem if I force you to attach orchids to trees?