Here’s a problem that I think remains. Suppose you’ve got an agent that prefers to have the button in the state that it believes matches my preferences. Call these ‘button-matching preferences.’ If the agent only has these preferences, it isn’t of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: ‘I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.’ An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational. And it seems difficult to ensure that the agent’s button-matching preferences will always win out over these incentives.
In case you’re interested, I discuss something similar here and especially in section 8.2.
Here’s a problem that I think remains. Suppose you’ve got an agent that prefers to have the button in the state that it believes matches my preferences. Call these ‘button-matching preferences.’ If the agent only has these preferences, it isn’t of much use. You have to give the agent other preferences to make it do useful work. And many patterns for these other preferences give the agent incentives to prevent the pressing of the button. For example, suppose the other preferences are: ‘I prefer lottery X to lottery Y iff lottery X gives a greater expectation of discovered facts than lottery Y.’ An agent with these preferences would be useful (it could discover facts for us), but it also has incentives to prevent shutdown: it can discover more facts if it remains operational. And it seems difficult to ensure that the agent’s button-matching preferences will always win out over these incentives.
In case you’re interested, I discuss something similar here and especially in section 8.2.