each one is annotated with how much utility we estimate it to have
for example, we list “giving someone sad a hug” as having, say, 6 utility, but “giving someone sad a hug | they didn’t consent to it” has −4 utility or something like that
we train it to learn human values
we give it new actions, and see if it can guess the utility it assigned to those
eventually it gets really accurate at that
eventually we generate completely random series-es of actions, and have it guess the utility of all of them
so it writes a near-infinite sized utility function containing millions of billions of different actions and the utility of each one
we make a second AI, dumber than the first one but still really smart, and plug that utility function (the one the first AI wrote) into it
if we’re still scared of it doing something weird, we can additionally tell the second AI to minimize doing actions that don’t affect (the first AI’s perception of human values) at all, to stop it from doing something really bad that current humanity can’t comprehend that the first AI wouldn’t be able to get humanity’s opinion on
[Question] why won’t this alignment plan work?
the idea:
we give the AI a massive list of actions
each one is annotated with how much utility we estimate it to have
for example, we list “giving someone sad a hug” as having, say, 6 utility, but “giving someone sad a hug | they didn’t consent to it” has −4 utility or something like that
we train it to learn human values
we give it new actions, and see if it can guess the utility it assigned to those
eventually it gets really accurate at that
eventually we generate completely random series-es of actions, and have it guess the utility of all of them
so it writes a near-infinite sized utility function containing millions of billions of different actions and the utility of each one
we make a second AI, dumber than the first one but still really smart, and plug that utility function (the one the first AI wrote) into it
we turn it on
awesome singularity stuff happens yay we did it
if we’re still scared of it doing something weird, we can additionally tell the second AI to minimize doing actions that don’t affect (the first AI’s perception of human values) at all, to stop it from doing something really bad that current humanity can’t comprehend that the first AI wouldn’t be able to get humanity’s opinion on