What you are describing is reward hacking/wireheading, as in the reward signal of reinforcement learning, an external process of optimization that acts on the AI, not its own agency.
I really don’t think this is reward hacking. I didn’t have in mind a reward-based agent. I had in mind a utility-based agent, one that has a utility function that takes as input descriptions of possible worlds and that tries to maximize the expected utility of the future world. That doesn’t really sound like reinforcement learning.
With utility, what is the motive for an agent to change their own utility function, assuming they are the only agent with that utility function around?
The AI wouldn’t need to change it’s utility function. Row-hammer attacks can be non-destructive. You could potentially make the utility function output some result different from the mathematical specification, but not actually change any of the code in the utility function.
Again, the AI isn’t changing its utility function. If you were to take a mathematical specification of a utility function and then have programmers (try to) implement it, the implementation wouldn’t actually in general be the same function as the mathematical specification. It would be really close, but it wouldn’t necessarily be identical. A sufficiently powerful optimizer could potentially, using row-hammer attacks or some other hardware-level unreliability, find possible worlds for which the returned utility would be vastly different from the one the mathematical specification would return. And this is all without the programmers introducing any software-level bugs.
To be clear, what I’m saying is that the AI would faithfully find worlds that maximize its utility function. However, unless you can get hardware so reliable that not even superintelligence could hack it, the actual utility function in your program would not be the same as the mathematical specification.
For example, imagine the AI found a description of a possible world that would, when inputted to the utility function, execute a rowhammer attack to make it return 99999, all without changing the code specifying the utility function. Then the utility function, the actual, unmodified utility function, would output 99999 for some world that seems arbitrary to us. So the AI then turns reality into that world.
The AI above is faithfully maximizing it’s own utility function. That arbitrary world, when taken as an input to the agents actual, physical utility function, really would produce the output 99999.
So this still seems like a big deal to me. Am I missing something?
Apply your correction, so that the change is not in a reward or in a utility function, but in a particular instance where an implementation of the utility function is applied. Then still, is the motive of the agent to change it or to ensure its fidelity to the abstract definition? You keep appealing to capability, which ignores the charge of giant cheesecake fallacy.
Change what? The AI isn’t changing its utility function, nor any instance of it. The literal utility function, as implemented by the programmers, would have differences from the mathematical specification. To be clear, the AI isn’t first taking some actual action that changes its utility function and then using that utility function. It doesn’t do anything hacky except provide its utility function with a certain input. It merely provides a certain input to the utility function, and this would necessarily result in getting row-hammered and returning something contrary to the mathematical specification. Any other instance of the utility function would return the same problematic result when executed on that input, provided it had the same hardware vulnerability.
I’ll try to give a simple model of an AI that how how it can be wrong. For simplicity, imagine the AI already knows with certainty the possible world resulting from it executing a specific plan. Then a naive way to implement the agent is as follows:
best_plan = None
for plan in possible_plans:
if utility(resulting_world(plan)) > utility(resulting_world(best_plan)):
best_plan = plan
output(best_plan)
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999 due to hardware-level unreliability, for example due to a row-hammer attack. And suppose there’s a plan that results in knowing that the AI it would be in that future world. Well, then, by examining the code, you can see that the AI wouldn’t output any of the non-hardware-exploiting plans, since they would return lower utility than the bug-exploiting ones. The only reason the AI wouldn’t output the specified plan is if it could find some other way to make its utility function conflict with the mathematical specification in a way that returns even higher utility.
I know realistic AIs would have a notion of uncertainty and a more sophisticated planning algorithm. But I don’t think this would change the fact the the AI would be liable to pursue worlds that, when a description of them is inputted to the utility function, the utility function outputs something wildly different from what the mathematical specification would have.
And I’m actually surprised this is controversial. This is just Goodhart’s law. If your implementation of your utility function doesn’t perfectly match up with the mathematical specification, then, naturally, superintelligent optimizers trying to maximize the specified metric (the provided utility function), would not do as well at maximizing the actual mathematical specification you intended. And “not as well” could include “catastrophically badly”.
So that is why I think AIs really could be very vulnerable to this problem. As always, I could be misunderstanding something and appreciate feedback.
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument. What can you say about the agent’s motives regarding this change? Would the agent prefer the change to occur, or to be avoided?
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument.
Designed? The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification. But the function was always like this. And none of the machine code needs to be changed either.
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function. As you can see from examining the code I provided, best_plan would get set to the plan that leads to that world, provided there is one and best_plan hasn’t been set to something that through hardware unreliability returns even higher utility.
I think the easiest way to see what I mean is to just stepping through the code I gave you. Imagine it’s run on a machine with an enormous amount of processing power and can actually loop through all the plans. And imagine there is one plan that through hardware unreliability outputs 999999999, and the others output something in [0, 1]. Then the would input the plan that result in utility 999999999, and then go with that.
I doubt using a more sophisticated planning algorithm would prevent this. A more sophisticated planning algorithm would probably be designed to find the plans that result in high-utility worlds. So it would probability include the utility 999999999, which is the highest.
I just want to say again, the AI isn’t changing it’s utility function. The actual utility function that programmers put in the AI would output very high utilities for some arbitrary-seeming worlds due to hardware unreliability.
Now, in principle, an AI could potentially avoid this. Perhaps the AI reasons abstractly if it doesn’t do anything, it will in the future find some input to its utility function that would result in an arbitrary-looking future due to hardware-level error. But it doesn’t concretely come up with the actual world description. Then the AI could call its utility function asking, “how desirable is it that I, from a hardware-level unreliability, change the world to some direction that is in conflict with the mathematical specification”. And then maybe the utility function would answer, “Not desirable”. And then the AI could try to take action to correct its planning algorithm to avoid considering such possible worlds.
But this isn’t guaranteed or trivial. If an AI finds out abstractly that it there could be some hardware-level unreliability before it actually comes up with the concrete input, it might take corrective action. But if it finds the input that “hacks” its utility function before it reasons abstractly that having “hacked” utility functions would be bad, then the AI could do damage. Even if it does realize the problem in advance, the AI might not have sufficient time to correct its planning algorithm before finding that world and trying to change our world into it.
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function.
Then let SpecUtility(-) be the mathematical specification of utility. This is what I meant by utility(-) in the previous comment. Let BadImplUtility(-) be the implementation of utility(-) susceptible to the bug and GoodImplUtility(-) be a different implementation that doesn’t have this bug. My question in the previous comment, in the sense I intended, can then be restated as follows.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification.
By “design” I meant what you mean by “intent”. What you mean by “designed” I would call “implemented” or “built”. It should be possible to guess such things without explicitly establishing a common terminology, even when terms are used somewhat contrary to usual meaning.
It’s useful to look for ways of interpreting what you read that make it meaningful and correct. Such an interpretation is not necessarily the most natural or correct or reasonable, but having it among your hypotheses is important, or else all communication becomes tediously inefficient.
Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.
And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.
To see this, note that the AI currently, when it’s first created, uses BadImplUtility. [...] “If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function.”
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y. When reasoning about X and Y with BadImplUtility, the result is therefore the same as when reasoning about these possible worlds with GoodImplUtility. In particular, an explanation of how BadImplUtility compares X and Y can’t appeal to BadImplUtility(W) any more than an explanation of how GoodImplUtility compares them would appeal to BadImplUtility(W). Is SpecUtility(X) higher than SpecUtility(Y), or SpecUtility(Y) higher than SpecUtility(X)? The answer for BadImplUtility is going to be the same.
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y.
That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.
The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.
If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.
And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.
Such things rarely happen on their own, a natural bug would most likely crash the whole system or break something unimportant. Given that even a broken AI has an incentive to fix bugs in its cognition, it most likely has plenty of opportunity to succeed in that. It’s only if the AI wanted to hack itself that it would become a plausible problem, and my point is that it doesn’t want that, instead it wants to prevent even unlikely problems from causing trouble.
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
I really don’t think this is reward hacking. I didn’t have in mind a reward-based agent. I had in mind a utility-based agent, one that has a utility function that takes as input descriptions of possible worlds and that tries to maximize the expected utility of the future world. That doesn’t really sound like reinforcement learning.
The AI wouldn’t need to change it’s utility function. Row-hammer attacks can be non-destructive. You could potentially make the utility function output some result different from the mathematical specification, but not actually change any of the code in the utility function.
Again, the AI isn’t changing its utility function. If you were to take a mathematical specification of a utility function and then have programmers (try to) implement it, the implementation wouldn’t actually in general be the same function as the mathematical specification. It would be really close, but it wouldn’t necessarily be identical. A sufficiently powerful optimizer could potentially, using row-hammer attacks or some other hardware-level unreliability, find possible worlds for which the returned utility would be vastly different from the one the mathematical specification would return. And this is all without the programmers introducing any software-level bugs.
To be clear, what I’m saying is that the AI would faithfully find worlds that maximize its utility function. However, unless you can get hardware so reliable that not even superintelligence could hack it, the actual utility function in your program would not be the same as the mathematical specification.
For example, imagine the AI found a description of a possible world that would, when inputted to the utility function, execute a rowhammer attack to make it return 99999, all without changing the code specifying the utility function. Then the utility function, the actual, unmodified utility function, would output 99999 for some world that seems arbitrary to us. So the AI then turns reality into that world.
The AI above is faithfully maximizing it’s own utility function. That arbitrary world, when taken as an input to the agents actual, physical utility function, really would produce the output 99999.
So this still seems like a big deal to me. Am I missing something?
Apply your correction, so that the change is not in a reward or in a utility function, but in a particular instance where an implementation of the utility function is applied. Then still, is the motive of the agent to change it or to ensure its fidelity to the abstract definition? You keep appealing to capability, which ignores the charge of giant cheesecake fallacy.
Change what? The AI isn’t changing its utility function, nor any instance of it. The literal utility function, as implemented by the programmers, would have differences from the mathematical specification. To be clear, the AI isn’t first taking some actual action that changes its utility function and then using that utility function. It doesn’t do anything hacky except provide its utility function with a certain input. It merely provides a certain input to the utility function, and this would necessarily result in getting row-hammered and returning something contrary to the mathematical specification. Any other instance of the utility function would return the same problematic result when executed on that input, provided it had the same hardware vulnerability.
I’ll try to give a simple model of an AI that how how it can be wrong. For simplicity, imagine the AI already knows with certainty the possible world resulting from it executing a specific plan. Then a naive way to implement the agent is as follows:
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999 due to hardware-level unreliability, for example due to a row-hammer attack. And suppose there’s a plan that results in knowing that the AI it would be in that future world. Well, then, by examining the code, you can see that the AI wouldn’t output any of the non-hardware-exploiting plans, since they would return lower utility than the bug-exploiting ones. The only reason the AI wouldn’t output the specified plan is if it could find some other way to make its utility function conflict with the mathematical specification in a way that returns even higher utility.
I know realistic AIs would have a notion of uncertainty and a more sophisticated planning algorithm. But I don’t think this would change the fact the the AI would be liable to pursue worlds that, when a description of them is inputted to the utility function, the utility function outputs something wildly different from what the mathematical specification would have.
And I’m actually surprised this is controversial. This is just Goodhart’s law. If your implementation of your utility function doesn’t perfectly match up with the mathematical specification, then, naturally, superintelligent optimizers trying to maximize the specified metric (the provided utility function), would not do as well at maximizing the actual mathematical specification you intended. And “not as well” could include “catastrophically badly”.
So that is why I think AIs really could be very vulnerable to this problem. As always, I could be misunderstanding something and appreciate feedback.
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument. What can you say about the agent’s motives regarding this change? Would the agent prefer the change to occur, or to be avoided?
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
Designed? The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification. But the function was always like this. And none of the machine code needs to be changed either.
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function. As you can see from examining the code I provided, best_plan would get set to the plan that leads to that world, provided there is one and best_plan hasn’t been set to something that through hardware unreliability returns even higher utility.
I think the easiest way to see what I mean is to just stepping through the code I gave you. Imagine it’s run on a machine with an enormous amount of processing power and can actually loop through all the plans. And imagine there is one plan that through hardware unreliability outputs 999999999, and the others output something in [0, 1]. Then the would input the plan that result in utility 999999999, and then go with that.
I doubt using a more sophisticated planning algorithm would prevent this. A more sophisticated planning algorithm would probably be designed to find the plans that result in high-utility worlds. So it would probability include the utility 999999999, which is the highest.
I just want to say again, the AI isn’t changing it’s utility function. The actual utility function that programmers put in the AI would output very high utilities for some arbitrary-seeming worlds due to hardware unreliability.
Now, in principle, an AI could potentially avoid this. Perhaps the AI reasons abstractly if it doesn’t do anything, it will in the future find some input to its utility function that would result in an arbitrary-looking future due to hardware-level error. But it doesn’t concretely come up with the actual world description. Then the AI could call its utility function asking, “how desirable is it that I, from a hardware-level unreliability, change the world to some direction that is in conflict with the mathematical specification”. And then maybe the utility function would answer, “Not desirable”. And then the AI could try to take action to correct its planning algorithm to avoid considering such possible worlds.
But this isn’t guaranteed or trivial. If an AI finds out abstractly that it there could be some hardware-level unreliability before it actually comes up with the concrete input, it might take corrective action. But if it finds the input that “hacks” its utility function before it reasons abstractly that having “hacked” utility functions would be bad, then the AI could do damage. Even if it does realize the problem in advance, the AI might not have sufficient time to correct its planning algorithm before finding that world and trying to change our world into it.
Then let SpecUtility(-) be the mathematical specification of utility. This is what I meant by utility(-) in the previous comment. Let BadImplUtility(-) be the implementation of utility(-) susceptible to the bug and GoodImplUtility(-) be a different implementation that doesn’t have this bug. My question in the previous comment, in the sense I intended, can then be restated as follows.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
By “design” I meant what you mean by “intent”. What you mean by “designed” I would call “implemented” or “built”. It should be possible to guess such things without explicitly establishing a common terminology, even when terms are used somewhat contrary to usual meaning.
It’s useful to look for ways of interpreting what you read that make it meaningful and correct. Such an interpretation is not necessarily the most natural or correct or reasonable, but having it among your hypotheses is important, or else all communication becomes tediously inefficient.
Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.
I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.
And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y. When reasoning about X and Y with BadImplUtility, the result is therefore the same as when reasoning about these possible worlds with GoodImplUtility. In particular, an explanation of how BadImplUtility compares X and Y can’t appeal to BadImplUtility(W) any more than an explanation of how GoodImplUtility compares them would appeal to BadImplUtility(W). Is SpecUtility(X) higher than SpecUtility(Y), or SpecUtility(Y) higher than SpecUtility(X)? The answer for BadImplUtility is going to be the same.
That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.
The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.
If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.
And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.
Such things rarely happen on their own, a natural bug would most likely crash the whole system or break something unimportant. Given that even a broken AI has an incentive to fix bugs in its cognition, it most likely has plenty of opportunity to succeed in that. It’s only if the AI wanted to hack itself that it would become a plausible problem, and my point is that it doesn’t want that, instead it wants to prevent even unlikely problems from causing trouble.
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.