Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.