I’m quite new to the AI alignment problem—have read something like 20-30 articles about it on LW and around, aiming for the most upvoted—and have a feeling that there is a fundamental problem that is mostly ignored. I wouldn’t be surprised if this feeling was wrong (because it is either not fundamental, or not ignored).
Imagine a future world, where a singularity occurred, and we have a nearly-omnipotent AI. AI understands human values, tries to do what we want, makes no mistakes—so we could say that humanity did pretty well on the AI alignment task. Or maybe not. How do we find out?
Let’s consider a few scenarios:
AI creates a heaven on Earth for everyone. Fertility rates keep falling—e.g. because people in heaven are not interested in procreation, or maybe we globally accept VHEM as the best ethical position, or for some other reason. Unfortunately, immortality turns out to be impossible. Humans go extinct.
AI creates a virtual heaven for everyone. We live forever lives as full of fun as possible—never knowing nothing of this is real. Note: there are no other sentient beings in those virtual heavens, so this has nothing to do with the simulation hypothesis.
It turns out that smarter, more knowledgable humans, who are “more the people they wished they were” just want to be able to get more happiness from simple everyday activities. AI strictly follows the Coherent Extrapolated Volition principles and skilfully teaches us Buddhism-or-whatever. Earth becomes the prettiest garden in the Universe, tended by a few billions of extremely happy gardeners.
Now, what mark would you give humanity on the “AI alignment” task in each of the scenarios? Is there any agreement among AI alignment researchers about this? I would be surprised neither by AAA nor FFF—despite the fact that I didn’t even touch the really hard problems, like spawning zillions of ems.
I have a feeling that such issues don’t get the attention they really deserve—we all happily agree about being turned into paperclips, creation of smiley-faces or mindcrime, and hope the answer to the rest of the problems is “utilitarianism”.
So, the question(s).
Is there any widely-accepted way of solving such problems (i.e. rating humanity in the scenarios above)?
Is this considered an important problem? Let’s say, the one that has to be solved before singularity? If not—why?
(Speculative bonus question, on which I hope to elaborate one day) are we sure the set of things we call “human values” is not contradictory when executed by an omnipotent being?
Answers: (1) Not one I would trust with my future, no. (2) I would be very surprised if humans managed to build a FAI without being able in principle to reliably judge the relative value of different scenarios. (3) I would be extremely surprised if the things we currently call human values were not contradictory, even if only because (a) they’re all underspecified, hence the first two questions, and (b) different humans have different values that really do conflict.
In your three scenarios, I’d say all of them are likely far better than we have any right to expect and would count as a positive singularity in my estimation, although there are enough unspecified details that could sway me otherwise.
For me the most glaring fault in them, especially (2) and (3), is that they prescribe a single kind of future that all humans somehow agree on or are persuaded/coerced to agree to. In this sense they make me feel a bit like I felt when I read Friendship is Optimal—not anything I’d deliberately set out to build as I am now, but something I’m capable of valuing and appreciating.
Also, for (1) the phrase “immortality turns out to be impossible” means very different things in sub-scenario (a) where life extension of humans beyond 120 years is impossible, vs (b) lifespans can be extended many times, possible to millions of subjective years or more, but we can’t escape the heat death of the universe. VHEM seems much more understandable to me in the latter than the former world, if making new people would shorten the lifespan of every existing person by spreading resources thinner.
This seems important to me too. I have some hope that it’s at least possibly deferrable until post-singularity, e.g. have the AI let everyone know it exists and will provide for everyone’s basic needs for a year while they think about what they want the future to look like. Stuart Armstrong’s fiction Just another day in utopia and The Adventure: a new Utopia story are examples of exploring possible answers to what we want.
-
E.g. Eliezer seems to think it’s not the perfect future: “The presence or absence of an external puppet master can affect my valuation of an otherwise fixed outcome. Even if people wouldn’t know they were being manipulated, it would matter to my judgment of how well humanity had done with its future. This is an important ethical issue, if you’re dealing with agents powerful enough to helpfully tweak people’s futures without their knowledge”.
Also, you write:
If we really want this, we have to restrain from spending our whole lives playing the best RPG possible.
Consider human rules “you are allowed to lie to someone for the sake of their own utility” and “everyone should be able to take control of their own life”. We know that lies about serious things never turn out good, so we lie only about things of little importance, and little lies like “yes grandma, that was very tasty” doesn’t contradict the second rule. This looks different when you are an ultimate deceiver.
-
A simple way of rating the scenarios above is to describe them as you have and ask humans what they think.
In a way… but I expect that what we actually need to solve is just how to make a narrow AI faithfully generate AI papers and AI safety papers that humans would have come up with given time.
The CEV paper has gone into this, but indeed human utility functions will have to be aggregated in some manner, and the manner in which to do this and allocate ressources can’t be derived from first principles. Fortunately human utility functions are logarithmic enough and enough people care about enough other people that the basin of acceptable solutions is quite large, especially if we get the possible future AIs to acausally trade with each other.
I do not see why that should be the case? Assuming virtual heavens, why couldn’t each individuals personal preferences be fullfilled?
The universe is finite, and has to be distributed in some manner.
Some people prefer interactions with the people alive today to ones with heavenly replicas of them. You might claim that there is no difference, but I say that in the end it’s all atoms, all the meaning is made up anyway, and we know exactly why those people would not approve if we described virtual heavens to them so we shouldn’t just do them anyway.
Some people care about what other people do in their virtual heavens. You could deontologically tell them to fuck off, but I’d expect the model of dictator lottery + acausal trade to arrive at another solution.
Do you think this is worth doing?
I thought that
either this was done a billion times and I just missed it
or this is neither important nor interesting to anyone but me
I see this not as a question to ask now, but later, on many levels of detail, when the omnipotent singleton is deciding what to do with the world. Of course we will have to figure out the correct way to pose such questions before deployment, but this can be deferred until we can generate research.