There’s been discussion of ‘gradient hacking’ lately, such as here. What I’m still unsure about is whether or not a gradient hacker is just another word for local minimum? It feels different but when I want to try to put a finer definition on it, I can’t. My best alternative is “local minimum, but malicious” but that seems odd since it depends upon some moral character.
And a follow-up that I just thought of: is reinforcement learning more prone to gradient hacking? For example, if a sub-agent guesses that a particular previously untried type of action would produce very high reward, the sub-agent might be able to direct the policy away from those actions. The learning process will never correct this behavior if the overall model never gets to learn that those actions are beneficial. Therefore the sub-agent can direct away from some classes of high-reward actions that it doesn’t like without being altered.
Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It’s also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it’s able.
Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you’re suggesting a no, but I think it’s in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn’t know anything about wireheading until it’s been tried. So that is an example of gradient hacking itself, isn’t it? Unlike in a supervised learning setup where the gradient descent ‘knows’ about all possible options and will modify any subagents that avoid giving the right answer.
So am I a gradient hacker whenever I just say no to drugs?
There’s been discussion of ‘gradient hacking’ lately, such as here. What I’m still unsure about is whether or not a gradient hacker is just another word for local minimum? It feels different but when I want to try to put a finer definition on it, I can’t. My best alternative is “local minimum, but malicious” but that seems odd since it depends upon some moral character.
And a follow-up that I just thought of: is reinforcement learning more prone to gradient hacking? For example, if a sub-agent guesses that a particular previously untried type of action would produce very high reward, the sub-agent might be able to direct the policy away from those actions. The learning process will never correct this behavior if the overall model never gets to learn that those actions are beneficial. Therefore the sub-agent can direct away from some classes of high-reward actions that it doesn’t like without being altered.
Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It’s also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it’s able.
Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you’re suggesting a no, but I think it’s in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn’t know anything about wireheading until it’s been tried. So that is an example of gradient hacking itself, isn’t it? Unlike in a supervised learning setup where the gradient descent ‘knows’ about all possible options and will modify any subagents that avoid giving the right answer.
So am I a gradient hacker whenever I just say no to drugs?