Definitely interesting. After some more thought and drawing some pictures, I agree that this counterfactual reasoning is a key part of what a lot of people want from corrigibility.
There are of course still some complications with learning / defining things like “the button is pressed and the AI shuts down,” or generalizing the human behaviors that indicate they want to shut down the AI from the training set to the real world (we don’t want an AI that shuts down too rarely nor too often).
I’m not sure how important an actual physical button is. Even given the utility function here, it seems like an AI will destroy the button once it’s confident people don’t really want to shut it down—it’s the policy that has to be sensitive, not the physical hardware.
I’m not sure how important an actual physical button is. Even given the utility function here, it seems like an AI will destroy the button once it’s confident people don’t really want to shut it down—it’s the policy that has to be sensitive, not the physical hardware.
I’ve been wanting to write a post about this, but basically upon further thought there are a handful of different ways that one can train and deploy the AI, which lead to different consequences with the specifics here.
If we make the roll out in the training procedure I described very very long, like as long as it takes to execute whichever plans you could imagine, then the problem you describe here would apply.
If we make the roll out much shorter, then some ambiguity arises in how to deploy the AI in such a way as to make its longer-term behavior compatible with the roll out length. I think this is likely something capabilities research will solve along the way, and I have some ideas for how to solve it, but they sort of become speculative. I think this may actually be useful in some ways for the purpose of limiting the AI’s planning horizon, which might be useful for controlling its power.
Another possibility would be, rather than optimizing the policy according to the given utility function, optimize the actions according to it. This would change it to no longer behave as if it was uncertainty about whether it’s in the button-press scenario or in the no-button-press scenario, but instead to always prepare for both scenarios. I think some of the properties induced by this would be advantageous, such as it never deciding that it’s figured out which of the scenarios it is in and therefore destroying the button. However, I think other properties may be problematic.
I have some ideas for hybrid approaches that achieve the best of all worlds, though I haven’t fully fleshed them out yet.
I also agree that the actual physical button ideally shouldn’t be necessary in a proper implementation of it; it seems like one should be able to replace the Bs term with a −Is term assuming one had a good impact measure, and it would lead to better results. Though I think it would be beneficial to also have a direct physical shutdown; that’s generally considered an important safety feature of dangerous machines AFAIK.
Definitely interesting. After some more thought and drawing some pictures, I agree that this counterfactual reasoning is a key part of what a lot of people want from corrigibility.
There are of course still some complications with learning / defining things like “the button is pressed and the AI shuts down,” or generalizing the human behaviors that indicate they want to shut down the AI from the training set to the real world (we don’t want an AI that shuts down too rarely nor too often).
I’m not sure how important an actual physical button is. Even given the utility function here, it seems like an AI will destroy the button once it’s confident people don’t really want to shut it down—it’s the policy that has to be sensitive, not the physical hardware.
I’ve been wanting to write a post about this, but basically upon further thought there are a handful of different ways that one can train and deploy the AI, which lead to different consequences with the specifics here.
If we make the roll out in the training procedure I described very very long, like as long as it takes to execute whichever plans you could imagine, then the problem you describe here would apply.
If we make the roll out much shorter, then some ambiguity arises in how to deploy the AI in such a way as to make its longer-term behavior compatible with the roll out length. I think this is likely something capabilities research will solve along the way, and I have some ideas for how to solve it, but they sort of become speculative. I think this may actually be useful in some ways for the purpose of limiting the AI’s planning horizon, which might be useful for controlling its power.
Another possibility would be, rather than optimizing the policy according to the given utility function, optimize the actions according to it. This would change it to no longer behave as if it was uncertainty about whether it’s in the button-press scenario or in the no-button-press scenario, but instead to always prepare for both scenarios. I think some of the properties induced by this would be advantageous, such as it never deciding that it’s figured out which of the scenarios it is in and therefore destroying the button. However, I think other properties may be problematic.
I have some ideas for hybrid approaches that achieve the best of all worlds, though I haven’t fully fleshed them out yet.
I also agree that the actual physical button ideally shouldn’t be necessary in a proper implementation of it; it seems like one should be able to replace the Bs term with a −Is term assuming one had a good impact measure, and it would lead to better results. Though I think it would be beneficial to also have a direct physical shutdown; that’s generally considered an important safety feature of dangerous machines AFAIK.