The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn’t care one way or the other?
As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.
Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don’t think that’s the case for most utility functions: they don’t have to be conditional on the AI’s existence, and the AI doesn’t have to be indifferent to what the world looks like the AI it’s destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).
The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn’t care one way or the other?
As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.
Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don’t think that’s the case for most utility functions: they don’t have to be conditional on the AI’s existence, and the AI doesn’t have to be indifferent to what the world looks like the AI it’s destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).
Great answer, thanks!