‘Utility Indifference’ (2010) by FHI researcher Stuart Armstrong
I just noticed that LessWrong has not yet linked to FHI researcher Stuart Amstrong’s brief technical report, Utility Indifference (2010). It opens:
Consider an AI that follows its own motivations. We’re not entirely sure what its motivations are, but we would prefer that the AI cooperate with humanity; or, failing that, that we can destroy it before it defects. We’ll have someone sitting in a room, their finger on a detonator, ready at the slightest hint of defection.
Unfortunately as has been noted… this does not preclude the AI from misbehaving. It just means that the AI must act to take control of the explosives, the detonators or the human who will press the button. For a superlatively intelligence AI, this would represent merely a slight extra difficulty. But now imagine that the AI was somehow indifferent to the explosives going off or not (but that nothing else was changed). Then if ever the AI does decide to defect, it will most likely do so without taking control of the explosives, as that would be easier than otherwise. By “easier” we mean that the chances of failure are less, since the plan is simpler… recall that under these assumptions, the AI counts getting blown up as an equal value to successfully defecting. How could we accomplish this indifference?
- The germ of an idea by 13 Nov 2014 18:58 UTC; 12 points) (
- 29 Feb 2012 10:29 UTC; 1 point) 's comment on Trapping AIs via utility indifference by (
This article jumps right into probabilistic explanation without actually putting into English what it’s trying to achieve. It seems like the idea is to set its function such that (say) it is indifferent to whether or not it gets blown up, regardless of the other consequences. This ensures that it’s always possible to blow it up, as it never cares about that, and will completely ignore you if you try—even if it becomes aware that you’re blowing it up, it won’t care about it, and won’t stop you.
The problem is how to frame its reward function. Something like—if you’re blown up, your reward for being blown up is precisely your current expected reward that you think you would have got if you’d carried on as you were.
The problem of this indifference is that blowing yourself up is almost certainly an easier problem than what we wanted the AI to do—it has to be, otherwise we wouldn’t be able to figure out how to blow up the AI. The AI ought to conclude that it might as well blow itself up because it doesn’t make any difference to do so. There’s no reason why it should do this, but equally, no reason why not.....
It’s more subtle than that: the AI will assume that every attempt to blow it up (in that specific way) will fail. Or rather, it will behave as if it believed that attempt will fail. This is because the utility of “successful blow up” is reset to that of “unsucessful blow up”, rather than the other way round. So it’s equally happy to be blown up now, as it was before under the assumption it wouldn’t be blown up.
Since it’s behaving as if it believed the blow up would fail, it will only trigger the blow-up deliberately if there is another good reason for it to do so.
I thought that was the way round I had it—the utility of a successful blowup is set equal to the expected utility of an unsuccessful blowup. I didn’t think there were any false beliefs involved—it’s simply a matter of the utility function, not a matter of whether it thinks it will be blown up or not. I thought you meant that the AI would know full well you were trying to blow it up, but wouldn’t mind because it was going to get the exact same utility for that state as for carrying on.
Yes, the AI knows full well, and won’t mind but. What I’m meaning by the “act as if it didn’t believe it would get blown up” is that before we adjust its utility, it has a particular behaviour B that it would follow if it believed that the detonator would never trigger. Then after adjust its utility to make it indifferent, it will follow B.
In terms of behaviour, this utility adjustment has the same effect as if we convinced it that the detonator could never trigger—but without it having false beliefs.
In computer science terms, this is going to result in an untidy shutdown of the machine. If the AI is actually doing something potentially dangerous at the time, then this algorithm will terminate the AI in the middle of doing it. It may even decide it’s entirely appropriate to start flying aircraft or operating nuclear plants after it already knows you’re going to blow it up.
Still better than letting it take over.
No doubt about that....
So for the branches where it gets blown up, it instead computes expected utility for the counterfactual where the explosives are duds. I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds. Also, the mechanism might be dangerous in itself. After all, it’s almost certainly going to create not just additional datacenters, but also also extend pieces of its intelligence into everyone’s cell phones, cars, etc. Then you have to choose between letting minor defections slide, and disabling a bunch of probably-vital-to-society technology all at once. And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can’t let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.
Since it is indifferent to being blown up, it should build its successors in the same way—why would it want its siblings to care about something it doesn’t?
Yep. This is nothing like a complete solution, and will most likely be used in other, more sublte ways (like making an Oracle AI indifferent to the consequences of its answers), rather than with this explosive example.
Some “ff” ligatures appear to have been lost in translation (indifferent --> indierent, off --> o).
Fixed.
If it has any trouble at all achieving its primary goal, won’t it just be easier to taunt you into blowing it up? It’s indifferent between the two, after all. I mean, this may well be a way of creating a safe AI, create an AI that kills itself if it tries to do anything interesting.
I certainly agree with the conclusion of the note:
‘Indifference’, remember, is just ‘U(S) = U(T )’ That is, a successful defection equals the value of the explosion going off? This is really problematic depending on how we cash out ‘defection’. For example, suppose defection involves taking over everything on Earth including the detonator—then defection looks better than taunting, since there are two avenues to ‘U(S)=U(T)’:
the AI takes over the detonator and pushing the detonator
or it wins in whatever the larger sense is
Another thought: I can’t quite follow all the formalism, but the key to indifference seems to be defining a set of worlds where the detonator set up remains unchanged. Doesn’t this encourage the AI to seize control of the detonator to make sure it does not change (ie. detonate)? (If we expand the set of detonator-worlds to include worlds where it has gone off and killed the AI, that might fix this issue, but adding in an explicit high valuing of successful suicide seems like it would not be very useful...)
The AI will never have any active desire to push the detonator (barring other reasons like someone saying “push the detonator, and I will donate $1 to the AI).
And this will not stop the AI from defecting, not at all. It will, however, ensure that while defecting, the detonator will not be a priority—it’s effectively just an inert lump from the AI’s persepective. So the AI will try and grab the nuclear missiles, or hack the president, or whatever, but the guy in the shed by the explosives is low down on the list. Maybe low down enough that they’ll be able to react on time.
To reiterate what I said, if defection is of positive value to the AI and pushing the detonator is == defection as it seems to be, then pushing the detonator is of equal positive value.
The detonator going off is of equal value to the detonator not going off. But pushing the button is of no value to the AI.
You’re right that this elmininates the downside to the button being pushed, from the AI’s perspective, so the AI is more likely to indulge in behaviours that result in the button being pushed (such as defecting and not gaining control of the explosive setup as a priority). Thus it eliminates the potential of the explosives as a threat to keep in well behaved—it keeps them solely as a security for the human race.
If you want to preserve the threat aspect, the simplest would be to add a second batch of independent explosives, to whose detonation the AI is not indifferent, with the two batches being as seperate and as unconnected as possible. Then you would get both advantages in one.
The link in the post no longer works. Here’s one that works.
Thanks for posting this on less wrong; I should have done so myself.
Is this from the school of “If you don’t make it hungry, it will starve to death even if it has goals and knows it needs to eat to live and live to accomplish the goals”?
Is there a name for that fallacy?
No, it’s from the school of “If it starves to death while otherwise trying to accomplish its goals, it will count this as a success, just as much as if it actually had accomplished its goals. So it won’t bother to eat”.
Sounds like this idea from a while back.
This method achieves its goal by adjusting the utility function, not by making the AI have false beliefs.
Yes. I realize they are not identical methods.
I’ve seen this mentioned elsewhere.
This is equivalent to the AI assigning getting killed a zero probability.
One problem is that the AI might decide to give everyone a button to kill it, so that they feel better about the AI. It will assume that nobody will press the button (or not care if they do), but someone is bound to.
Make it indifferent to a specific person or group blowing it up, not to anyone doing so.
Then you have to somehow define a specific person (you don’t want it to not count the future iterations of them) and blowing it up (you don’t want it to figure that they blew it up in the last case, on the basis of the butterfly effect. Both of these ideas are easy for a person to understand, but don’t seem to mean a whole lot on deeper levels.
No, I don’t need that. I make the AI indifferent to a particular quantum event. That particular quantum event is a detonating event in a particular setup with explosives. The AI will act as if it believed that that quantum event will never result in the explosives going off.
The AI is hence not directly indifferent to that specific person and explosives, but only as a consequence of acting as if it believed that the detonator would never work.
In that case, it may decide to move its mainframe or the explosives. It seems highly likely that it will upload itself onto the internet to take advantage of all of the computing power.