Chantiel comments on Alignment via manually implementing the utility function

Chantiel 23 Mar 2022 14:07 UTC
1 point

Another problem is that the system cannot represent and communicate the whole predicted future history of the universe to us.

This is a good point and one that I, foolishly, hadn’t considered.

However, it seems to me that there is a way to get around this. Specifically, just provide the query-answerers the option to refuse to evaluate the utility of a description of a possible future. If this happens, the AI won’t be able to have its utility function return a value for such a possible future.

To see how to do this, note that if a description of a possible future world is too large for the human to understand, then the human can refuse to provide a utility for it.

Similarly, if the description of the future doesn’t specify the future with sufficient detail that the person can clearly tell if the described outcome would be good, then the person can also refuse to return a value.

For example, suppose you are making an AI designed to make paperclips. And suppose the AI queries the person asking for the utility of the possible future described by, “The AI makes a ton of paperclips”. Then the person could refuse to answer, because the description is insufficient to specify the quality of the outcome, for example, because it doesn’t say whether or not Earth got destroyed.

Instead, a possible future would only be rated as high utility if it says something like,”The AI makes a ton of paperclips, and the world isn’t destroyed, and the AI doesn’t take over the world, and no creatures get tortured anywhere in our Hubble sphere, and creatures in the universe are generally satisfied”.

Does this make sense?

I, of course, could always be missing something.

(Sorry for the late response)