AI that is trained by human teachers, giving it rewards will eventually wirehead, as it becomes smarter and more powerful, and its influence over its master increases. It will, in effect, develop the ability to push its own “reward” button. Thus, its behavior will become misaligned with whatever its developers intended.
This seems like an unproven statement. Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs. The great majority of humans don’t choose to do that to themselves. Assuming that AI will act differently seems like an unproven assumption, one which might, for example, be justified for some AI capability levels but not others.
Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs.
For humans, there are negative rewards for abusing drugs/alcohol—hangover the next day, health issues, etc. You could argue that they are taking those into account.
But for an entirely RL-driven AI, wireheading has no anticipated downsides.
In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn’t pure RL.
Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to “wire-head” by following that simple maximization path instead. So people do their best not to do that when working with RL.
I think that regularization in RL is normally used to get more rewards (out-of-sample).
Sure, you can increase it further and do the opposite – subvert the goal of RL (and prevent wireheading).
But wireheading is not an instability, local optimum, or overfitting. It is in fact the optimal policy, if some of your actions let you choose maximum rewards.
Anyway, the quote you are referring to says “as (AI) becomes smarter and more powerful”.
It doesn’t say that every RL algorithm will wirehead (find the optimal policy), but that an ASI-level one will. I have no mathematical proof of this, since these are fuzzy concepts. I edited the original text to make it less controversial.
This seems like an unproven statement. Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs. The great majority of humans don’t choose to do that to themselves. Assuming that AI will act differently seems like an unproven assumption, one which might, for example, be justified for some AI capability levels but not others.
For humans, there are negative rewards for abusing drugs/alcohol—hangover the next day, health issues, etc. You could argue that they are taking those into account.
But for an entirely RL-driven AI, wireheading has no anticipated downsides.
In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn’t pure RL.
Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to “wire-head” by following that simple maximization path instead. So people do their best not to do that when working with RL.
I think that regularization in RL is normally used to get more rewards (out-of-sample).
Sure, you can increase it further and do the opposite – subvert the goal of RL (and prevent wireheading).
But wireheading is not an instability, local optimum, or overfitting. It is in fact the optimal policy, if some of your actions let you choose maximum rewards.
Anyway, the quote you are referring to says “as (AI) becomes smarter and more powerful”.
It doesn’t say that every RL algorithm will wirehead (find the optimal policy), but that an ASI-level one will. I have no mathematical proof of this, since these are fuzzy concepts. I edited the original text to make it less controversial.