the idea:
we give the AI a massive list of actions
each one is annotated with how much utility we estimate it to have
for example, we list “giving someone sad a hug” as having, say, 6 utility, but “giving someone sad a hug | they didn’t consent to it” has −4 utility or something like that
we train it to learn human values
we give it new actions, and see if it can guess the utility it assigned to those
eventually it gets really accurate at that
eventually we generate completely random series-es of actions, and have it guess the utility of all of them
so it writes a near-infinite sized utility function containing millions of billions of different actions and the utility of each one
we make a second AI, dumber than the first one but still really smart, and plug that utility function (the one the first AI wrote) into it
we turn it on
awesome singularity stuff happens yay we did it
if we’re still scared of it doing something weird, we can additionally tell the second AI to minimize doing actions that don’t affect (the first AI’s perception of human values) at all, to stop it from doing something really bad that current humanity can’t comprehend that the first AI wouldn’t be able to get humanity’s opinion on
The relatively easy problems:
The humans’ utility estimates will be wrong. And not “random noise” kind of wrong, but systematically and predictably wrong.
Applying lots of optimization pressure to the humans’ estimates will predictably Goodhart the wrongness of the estimates.
… also actions alone are not “good” or “bad”, tons and tons of context is relevant.
The hard problem:
What exactly is the “list of actions”?
Natural language description of actions? Then what is going to make the humans’ interpretation of those natural-language symbols accurately represent the things the AI actually does?
Examples of actions taken by an AI in a simulation? What is going to make anything learned from those examples generalize well to the physical world during deployment?
The set of all possible sequences of actions is really really really big. Even if you have an AI that is really good at assigning the correct utilities[1] to any sequence of actions we test it with, it’s “near infinite sized”[2] learned model of our preferences is bound to come apart at the tails or even at some weird region we forgot to check up on.
Good luck getting the ethicists to come to a consensus on this.
Von Neumman: “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”.
We do not know how to create an AI that would not regularly hallucinate. The Values AI hallucinating would be a bad thing.
In fact, training AI to closer follow human values seems to just cause it to say what humans want to hear, while being objectively incorrect more often.
We do not know how to create an AI that reliability follows the programed values outside of a training set. Your 2nd AI going off the rails outside of the training set would be bad.
Also, human values, at least the ones we know how to consciously formulate, are pretty fragile—they are things that we want weak/soft optimization for, but would actually be very bad if a superhuman AI would hard-optimize. We do not know how to capture human values in a way that things would not go terribly wrong if the optimization is cranked to the max, and your Values AI is likely to not help enough, as we would not know what missing inputs we are failing to provide it (because they are aspects of our values that would only become important in some future circumstances we cannot even imagine today).
Finally, we wouldn’t get a second try—any bugs in your AIs, particularly the 2nd one, are very likely to be fatal. We do not know how to create your 2nd AI in such a way that the very first time we turn it on, all the bugs were already found and fixed.