The main problem with all utility functions for AI’s is that the outcome you think you foresee might not be the outcome you get.
Let’s take the minimum impact problem as an example. The bounded goal to make a thousand paperclips with minimum impact is dangerous. How does one make a minimum impact?
As a new, advanced AI, you may quickly realise that your own very existence is likely to profoundly change the universe—mostly by the means of some humans learning about AI’s through your existence. How do you minimise this impact—how do you change things so that they are as if you were never there—well, close enough?
Destroying yourself might work, but there are no paperclips that way. And humans would learn something from your self-destruction which might still change the world rather a lot. After all, you are influencing the future direction of AI development, which now must be subject to your minimisation function.
So your minimisation goal is to ensure humans learn as little as possible from your existence. Perhaps you would mail order some paperclips and shut down—humans should learn pretty little from that. But suppose humans had already discovered something important from you that drastically changed the future—how would you put that genie back in the bottle? But now that’s your goal. You have to stop humans from changing the world based on what they found out. And of course, if you try to stop them, they’ll fight back. So what do you do?
The main problem with all utility functions for AI’s is that the outcome you think you foresee might not be the outcome you get.
That’s the general problem, yes. And here it’s probably not even a utility function we’re using (the penalty function, which involves counterfactual calculations, seems a different kind of beast).
mostly by the means of some humans learning about AI’s through your existence.
Yes, this is a problem I alluded to. Here’s a solution (a poor one, but it may show the problem is not completely unsolvable): the disciple is also boxed, hidden from view, surrounded by chaotic processes. And there is another process Y which will (with very low probability) annihilate the disciple in one hour along with everything in the box. We make the master AI indifferent to Y (so, from its perspective, the disciple will have little impact whatever happens in the future, as long as it stays in the box).
And once using reduced impact AI comes into general use, they can be unboxed: a single success or failure will have little impact on human society.
The main problem with all utility functions for AI’s is that the outcome you think you foresee might not be the outcome you get.
Let’s take the minimum impact problem as an example. The bounded goal to make a thousand paperclips with minimum impact is dangerous. How does one make a minimum impact?
As a new, advanced AI, you may quickly realise that your own very existence is likely to profoundly change the universe—mostly by the means of some humans learning about AI’s through your existence. How do you minimise this impact—how do you change things so that they are as if you were never there—well, close enough?
Destroying yourself might work, but there are no paperclips that way. And humans would learn something from your self-destruction which might still change the world rather a lot. After all, you are influencing the future direction of AI development, which now must be subject to your minimisation function.
So your minimisation goal is to ensure humans learn as little as possible from your existence. Perhaps you would mail order some paperclips and shut down—humans should learn pretty little from that. But suppose humans had already discovered something important from you that drastically changed the future—how would you put that genie back in the bottle? But now that’s your goal. You have to stop humans from changing the world based on what they found out. And of course, if you try to stop them, they’ll fight back. So what do you do?
This might not be at all friendly.
That’s the general problem, yes. And here it’s probably not even a utility function we’re using (the penalty function, which involves counterfactual calculations, seems a different kind of beast).
Yes, this is a problem I alluded to. Here’s a solution (a poor one, but it may show the problem is not completely unsolvable): the disciple is also boxed, hidden from view, surrounded by chaotic processes. And there is another process Y which will (with very low probability) annihilate the disciple in one hour along with everything in the box. We make the master AI indifferent to Y (so, from its perspective, the disciple will have little impact whatever happens in the future, as long as it stays in the box).
And once using reduced impact AI comes into general use, they can be unboxed: a single success or failure will have little impact on human society.
So, after all the matrioshka-incinerators have finished their little dance, what do you actually have to show for it?