Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI’s estimate of its expected discounted future reward, not some outside estimate.
(This is in the interest of asking dumb questions and learning in public. And I’m taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)
Thoughts: I’ve heard Dr. Christiano mention things like “precisely balanced incentives” (which I think uses to describe the weaknesses of this and similar approaches), but I don’t see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:
Such an agent would have no incentive to create corrigible child-agents
Such an agent would have no incentive to preserve this property while self-modifying
But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.
Thanks,
Raf
The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn’t care one way or the other?
As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.
Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don’t think that’s the case for most utility functions: they don’t have to be conditional on the AI’s existence, and the AI doesn’t have to be indifferent to what the world looks like the AI it’s destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).
Great answer, thanks!
AI will press the button itself?
If implemented as described, the AI should be exactly indifferent to pushing the button? I guess the AI’s behavior in that situation is not well defined… and if we make the button give expected value minus epsilon reward, then the AI might kill you to stop you from pressing the button (because it wants that epsilon reward!)
So overall I suppose this is a fair criticism of the approach and is possibly what Paul means by issues with precisely balancing!
You’re assuming that the AI is internally acting to maximize its future reward. I.e., you’re assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy “make the humans press the button”? What’s it’s expected future reward for this strategy?
I am new to the AI Alignment field, but at first glance, this seems promising! You can probably hard-code it not to have the ability to turn itself off, if that turns out to be a problem in practice. We’d want to test this in some sort of basic simulation first. The problem would definitely be self-modification and I can imagine the system convincing a human to turn it off in some strange, manipulative, and potentially dangerous way. For instance, the model could begin attacking humans, instantly causing a human to run to shut it down, so the model would leave a net negative impact despite having achieved the same reward.
What I like about this approach is that it is simple/practical to test and implement. If we have some sort of alignment sandbox (using a much more basic AI as a controller or test subject) we can give the AI a way of simply manipulating another agent to press the button, as well as ways of maximizing its alternative reward function.
Upvoted, and I’m really interested to see the other replies here.