This article jumps right into probabilistic explanation without actually putting into English what it’s trying to achieve. It seems like the idea is to set its function such that (say) it is indifferent to whether or not it gets blown up, regardless of the other consequences. This ensures that it’s always possible to blow it up, as it never cares about that, and will completely ignore you if you try—even if it becomes aware that you’re blowing it up, it won’t care about it, and won’t stop you.
The problem is how to frame its reward function. Something like—if you’re blown up, your reward for being blown up is precisely your current expected reward that you think you would have got if you’d carried on as you were.
The problem of this indifference is that blowing yourself up is almost certainly an easier problem than what we wanted the AI to do—it has to be, otherwise we wouldn’t be able to figure out how to blow up the AI. The AI ought to conclude that it might as well blow itself up because it doesn’t make any difference to do so. There’s no reason why it should do this, but equally, no reason why not.....
It’s more subtle than that: the AI will assume that every attempt to blow it up (in that specific way) will fail. Or rather, it will behave as if it believed that attempt will fail. This is because the utility of “successful blow up” is reset to that of “unsucessful blow up”, rather than the other way round. So it’s equally happy to be blown up now, as it was before under the assumption it wouldn’t be blown up.
Since it’s behaving as if it believed the blow up would fail, it will only trigger the blow-up deliberately if there is another good reason for it to do so.
I thought that was the way round I had it—the utility of a successful blowup is set equal to the expected utility of an unsuccessful blowup. I didn’t think there were any false beliefs involved—it’s simply a matter of the utility function, not a matter of whether it thinks it will be blown up or not. I thought you meant that the AI would know full well you were trying to blow it up, but wouldn’t mind because it was going to get the exact same utility for that state as for carrying on.
Yes, the AI knows full well, and won’t mind but. What I’m meaning by the “act as if it didn’t believe it would get blown up” is that before we adjust its utility, it has a particular behaviour B that it would follow if it believed that the detonator would never trigger. Then after adjust its utility to make it indifferent, it will follow B.
In terms of behaviour, this utility adjustment has the same effect as if we convinced it that the detonator could never trigger—but without it having false beliefs.
In computer science terms, this is going to result in an untidy shutdown of the machine. If the AI is actually doing something potentially dangerous at the time, then this algorithm will terminate the AI in the middle of doing it. It may even decide it’s entirely appropriate to start flying aircraft or operating nuclear plants after it already knows you’re going to blow it up.
This article jumps right into probabilistic explanation without actually putting into English what it’s trying to achieve. It seems like the idea is to set its function such that (say) it is indifferent to whether or not it gets blown up, regardless of the other consequences. This ensures that it’s always possible to blow it up, as it never cares about that, and will completely ignore you if you try—even if it becomes aware that you’re blowing it up, it won’t care about it, and won’t stop you.
The problem is how to frame its reward function. Something like—if you’re blown up, your reward for being blown up is precisely your current expected reward that you think you would have got if you’d carried on as you were.
The problem of this indifference is that blowing yourself up is almost certainly an easier problem than what we wanted the AI to do—it has to be, otherwise we wouldn’t be able to figure out how to blow up the AI. The AI ought to conclude that it might as well blow itself up because it doesn’t make any difference to do so. There’s no reason why it should do this, but equally, no reason why not.....
It’s more subtle than that: the AI will assume that every attempt to blow it up (in that specific way) will fail. Or rather, it will behave as if it believed that attempt will fail. This is because the utility of “successful blow up” is reset to that of “unsucessful blow up”, rather than the other way round. So it’s equally happy to be blown up now, as it was before under the assumption it wouldn’t be blown up.
Since it’s behaving as if it believed the blow up would fail, it will only trigger the blow-up deliberately if there is another good reason for it to do so.
I thought that was the way round I had it—the utility of a successful blowup is set equal to the expected utility of an unsuccessful blowup. I didn’t think there were any false beliefs involved—it’s simply a matter of the utility function, not a matter of whether it thinks it will be blown up or not. I thought you meant that the AI would know full well you were trying to blow it up, but wouldn’t mind because it was going to get the exact same utility for that state as for carrying on.
Yes, the AI knows full well, and won’t mind but. What I’m meaning by the “act as if it didn’t believe it would get blown up” is that before we adjust its utility, it has a particular behaviour B that it would follow if it believed that the detonator would never trigger. Then after adjust its utility to make it indifferent, it will follow B.
In terms of behaviour, this utility adjustment has the same effect as if we convinced it that the detonator could never trigger—but without it having false beliefs.
In computer science terms, this is going to result in an untidy shutdown of the machine. If the AI is actually doing something potentially dangerous at the time, then this algorithm will terminate the AI in the middle of doing it. It may even decide it’s entirely appropriate to start flying aircraft or operating nuclear plants after it already knows you’re going to blow it up.
Still better than letting it take over.
No doubt about that....