Alignment: “Do what I would have wanted you to do”

Yoshua Bengio writes[1]:

nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans

I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn’t try.

Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.

There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I’ll assume the readers can figure out. You give it a goal “Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do”.

Whoever builds this AGI will choose what X will be[3]. If it’s a private project with investors, they’ll probably have a say, as an incentive to invest.

Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn’t say “Do what X wants you to do now”.

Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning.

Will it turn everyone into paperclips? I don’t think so. That’s not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out.

Will it manipulate its creators into giving it rewards? No. There are no “rewards”.

Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.

Will it resist being turned off? Maybe. Depends on whether it thinks that this is what (pre-ASI) X would have wanted it to.

  1. ^
  2. ^

    I’m not familiar with the ASI alignment literature, but presumably he is. I googled “would have wanted” + “alignment” on this site, and this didn’t seem to turn up much. If this has already been proposed, please let me know in the comments.

  3. ^

    Personally, I’d probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.