So for the branches where it gets blown up, it instead computes expected utility for the counterfactual where the explosives are duds. I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds. Also, the mechanism might be dangerous in itself. After all, it’s almost certainly going to create not just additional datacenters, but also also extend pieces of its intelligence into everyone’s cell phones, cars, etc. Then you have to choose between letting minor defections slide, and disabling a bunch of probably-vital-to-society technology all at once. And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can’t let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.
I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds.
Since it is indifferent to being blown up, it should build its successors in the same way—why would it want its siblings to care about something it doesn’t?
And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can’t let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.
Yep. This is nothing like a complete solution, and will most likely be used in other, more sublte ways (like making an Oracle AI indifferent to the consequences of its answers), rather than with this explosive example.
So for the branches where it gets blown up, it instead computes expected utility for the counterfactual where the explosives are duds. I think the hard part would be getting it to extend the disabling mechanism to the successors and siblings it builds. Also, the mechanism might be dangerous in itself. After all, it’s almost certainly going to create not just additional datacenters, but also also extend pieces of its intelligence into everyone’s cell phones, cars, etc. Then you have to choose between letting minor defections slide, and disabling a bunch of probably-vital-to-society technology all at once. And since it computes expected utility on the assumption that the disabling mechanism is a dud, you can’t let it do things like actively control unstable nuclear reactors, or the button becomes too dangerous to push.
Since it is indifferent to being blown up, it should build its successors in the same way—why would it want its siblings to care about something it doesn’t?
Yep. This is nothing like a complete solution, and will most likely be used in other, more sublte ways (like making an Oracle AI indifferent to the consequences of its answers), rather than with this explosive example.