Trying to change the human’s preference to match the button is one issue there. The other issue is that if the AI incorrectly estimates the human’s preferences (or, more realistically, we humans building the AI fail to operationalize “our preference re:button state” in such a way that the thing the AI is aimed at doesn’t match what we intuitively mean by that phrase), then that’s really bad.
Another frame: this would basically just be trying to align to human values directly, and has all the usual problems with directly aligning to human values, which is exactly what all this corrigibility-style stuff was meant to avoid.
I agree with the first paragraph, but strongly disagree with the idea this is “basically just trying to align to human values directly”.
Human values are a moving target in a very high dimensional space, which needs many bits to specify. At a given time, this needs one bit. A coinflip has a good shot. Also, to use your language, I think “human is trying to press the button” is likely to form a much cleaner natural abstraction than human values generally.
Finally, we talk about getting it wrong being really bad. But there’s a strong asymmetry—one direction is potentially catastrophic, the other is likely to only be a minor nuisance. So if we can bias it in favor of believing the humans probably want to press the button, it becomes even more safe.
Here’s a problem that I think remains. Suppose you’ve got an agent that prefers to have the button in the state that it believes matches my preferences. Call these ‘button-matching preferences.’ If the agent only has these preferences, it isn’t of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: ‘I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.’ An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational. And it seems difficult to ensure that the agent’s button-matching preferences will always win out over these incentives.
In case you’re interested, I discuss something similar here and especially in section 8.2.
Trying to change the human’s preference to match the button is one issue there. The other issue is that if the AI incorrectly estimates the human’s preferences (or, more realistically, we humans building the AI fail to operationalize “our preference re:button state” in such a way that the thing the AI is aimed at doesn’t match what we intuitively mean by that phrase), then that’s really bad.
Another frame: this would basically just be trying to align to human values directly, and has all the usual problems with directly aligning to human values, which is exactly what all this corrigibility-style stuff was meant to avoid.
I agree with the first paragraph, but strongly disagree with the idea this is “basically just trying to align to human values directly”.
Human values are a moving target in a very high dimensional space, which needs many bits to specify. At a given time, this needs one bit. A coinflip has a good shot. Also, to use your language, I think “human is trying to press the button” is likely to form a much cleaner natural abstraction than human values generally.
Finally, we talk about getting it wrong being really bad. But there’s a strong asymmetry—one direction is potentially catastrophic, the other is likely to only be a minor nuisance. So if we can bias it in favor of believing the humans probably want to press the button, it becomes even more safe.
Here’s a problem that I think remains. Suppose you’ve got an agent that prefers to have the button in the state that it believes matches my preferences. Call these ‘button-matching preferences.’ If the agent only has these preferences, it isn’t of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: ‘I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.’ An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational. And it seems difficult to ensure that the agent’s button-matching preferences will always win out over these incentives.
In case you’re interested, I discuss something similar here and especially in section 8.2.