In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn’t pure RL.
Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to “wire-head” by following that simple maximization path instead. So people do their best not to do that when working with RL.
I think that regularization in RL is normally used to get more rewards (out-of-sample).
Sure, you can increase it further and do the opposite – subvert the goal of RL (and prevent wireheading).
But wireheading is not an instability, local optimum, or overfitting. It is in fact the optimal policy, if some of your actions let you choose maximum rewards.
Anyway, the quote you are referring to says “as (AI) becomes smarter and more powerful”.
It doesn’t say that every RL algorithm will wirehead (find the optimal policy), but that an ASI-level one will. I have no mathematical proof of this, since these are fuzzy concepts. I edited the original text to make it less controversial.
In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn’t pure RL.
Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to “wire-head” by following that simple maximization path instead. So people do their best not to do that when working with RL.
I think that regularization in RL is normally used to get more rewards (out-of-sample).
Sure, you can increase it further and do the opposite – subvert the goal of RL (and prevent wireheading).
But wireheading is not an instability, local optimum, or overfitting. It is in fact the optimal policy, if some of your actions let you choose maximum rewards.
Anyway, the quote you are referring to says “as (AI) becomes smarter and more powerful”.
It doesn’t say that every RL algorithm will wirehead (find the optimal policy), but that an ASI-level one will. I have no mathematical proof of this, since these are fuzzy concepts. I edited the original text to make it less controversial.