You can estimate the likely loss and likely gain from that utility change, as with anything.
You can try. Your estimate is likely to be very diffuse and uncertain—the issue is that you are trying to get a handle on the distribution tail and that is quite hard to do (see Taleb’s black swans, etc.)
As long as you’re reasonably certain that the bottom parts of the utility function are more likely to be accessed through extortion than through other means, this is a rational thing to do
Not at all—you’re forgetting the about the magnitude of consequences.
Let’s say you have a blackmailer who wants a pony and she has the capability to meddle with your AI’s sensors. Lo and behold, she walks up to the AI and says “I want a pony! Look, there is a large incoming asteroid on a collision course with Earth. Gimme a pony and I’ll tell you if it’s real”.
Ah, says you the designer. I estimate that the blackmailer is bluffing in 99% of the cases. That “bottom part of the utility function” (aka The Sweet Meteor Of Death) is much more likely to be accessed through extortion, a hundred times more likely, in fact.
Therefore I will instruct the AI to disregard any data that tells it there an incoming asteroid on a collision course. And voila—the blackmailer doesn’t get a pony.
You can try. Your estimate is likely to be very diffuse and uncertain—the issue is that you are trying to get a handle on the distribution tail and that is quite hard to do (see Taleb’s black swans, etc.)
Not at all—you’re forgetting the about the magnitude of consequences.
Let’s say you have a blackmailer who wants a pony and she has the capability to meddle with your AI’s sensors. Lo and behold, she walks up to the AI and says “I want a pony! Look, there is a large incoming asteroid on a collision course with Earth. Gimme a pony and I’ll tell you if it’s real”.
Ah, says you the designer. I estimate that the blackmailer is bluffing in 99% of the cases. That “bottom part of the utility function” (aka The Sweet Meteor Of Death) is much more likely to be accessed through extortion, a hundred times more likely, in fact.
Therefore I will instruct the AI to disregard any data that tells it there an incoming asteroid on a collision course. And voila—the blackmailer doesn’t get a pony.
What could possibly go wrong?
The sweet meteor of death is well above the z point. Complete human extinction is above the z point.
This hack is not intended to deal with normal extortion, it’s intended to cut off really bad outcomes.
What would these be? Can you give a couple of examples?
Are you basically trying to escape Pascal’s Mugging?
The extortion version of that, yes.