The problem with weird tricks like this is that there are an endless number of technicalities that could break it. For example, suppose the AI decides that it wants to wipe out every human except one. Then it won’t trigger the fuse, it’ll come up with another strategy. Any other objection to the fake implementation details of the self destruct mechanism would have the same effect. It might also notice the incendiaries inside its brain and remove them, build a copy of itself without a corresponding mechanism, etc.
On the other hand, there is some value to self-destruct safeguards if they’re in the form of code in the AI’s mind that uses state inspection (telepathy) to detect lethal intent directly. But that can only partially reduce the rusk, not eliminate it entirely.
The problem with weird tricks like this is that there are an endless number of technicalities that could break it. For example, suppose the AI decides that it wants to wipe out every human except one. Then it won’t trigger the fuse, it’ll come up with another strategy. Any other objection to the fake implementation details of the self destruct mechanism would have the same effect. It might also notice the incendiaries inside its brain and remove them, build a copy of itself without a corresponding mechanism, etc.
On the other hand, there is some value to self-destruct safeguards if they’re in the form of code in the AI’s mind that uses state inspection (telepathy) to detect lethal intent directly. But that can only partially reduce the rusk, not eliminate it entirely.