[Question] why won’t this alignment plan work?

the idea:

  • we give the AI a massive list of actions

    • each one is annotated with how much utility we estimate it to have

    • for example, we list “giving someone sad a hug” as having, say, 6 utility, but “giving someone sad a hug | they didn’t consent to it” has −4 utility or something like that

  • we train it to learn human values

    • we give it new actions, and see if it can guess the utility it assigned to those

    • eventually it gets really accurate at that

  • eventually we generate completely random series-es of actions, and have it guess the utility of all of them

  • so it writes a near-infinite sized utility function containing millions of billions of different actions and the utility of each one

  • we make a second AI, dumber than the first one but still really smart, and plug that utility function (the one the first AI wrote) into it

  • we turn it on

  • awesome singularity stuff happens yay we did it

    if we’re still scared of it doing something weird, we can additionally tell the second AI to minimize doing actions that don’t affect (the first AI’s perception of human values) at all, to stop it from doing something really bad that current humanity can’t comprehend that the first AI wouldn’t be able to get humanity’s opinion on