Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.