Empirically, we already have systems that reward hack. You train a system to grab a ball with its claw; it learns to place the claw between the camera and the ball so that it appears to be grasping the ball.
I think “reward hacking” (I think “reinforcement attractors” would be a better name) is dominated by questions of exploration. If systems explore into reinforcement attractors, they will stay there. If it’s easier to explore into a “false-alarm” reinforcement event (e.g. claw in front of ball), then that will probably happen. Same for the classic boat racing example.
I’m definitely on board with reinforcement attractors being a real problem which has been really observed.
But I think there’s a very important caveat here, which is that these reinforcement attractors have to be explored into (unless the system already is cognizant of its reinforcement function, and is “trying” to make the reinforcement number go up in general). I worry that this exploration component is not recognized, or is acknowledged as an afterthought. But AFAICT exploration is actually the main driver of whether reinforcement attractors are realized.
Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have “awareness” of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like “You are the CEO of Walmart, you want to make the company maximally profitable” in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in “Defining and characterizing reward gaming” and the solution of “Stop before you encounter the attractors/hackable policy” seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don’t count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it’s LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?
EDIT: aren’t we risking that all the trope on evil AI act as an attractor for LLMs?
Narrow comment:
I think “reward hacking” (I think “reinforcement attractors” would be a better name) is dominated by questions of exploration. If systems explore into reinforcement attractors, they will stay there. If it’s easier to explore into a “false-alarm” reinforcement event (e.g. claw in front of ball), then that will probably happen. Same for the classic boat racing example.
I’m definitely on board with reinforcement attractors being a real problem which has been really observed.
But I think there’s a very important caveat here, which is that these reinforcement attractors have to be explored into (unless the system already is cognizant of its reinforcement function, and is “trying” to make the reinforcement number go up in general). I worry that this exploration component is not recognized, or is acknowledged as an afterthought. But AFAICT exploration is actually the main driver of whether reinforcement attractors are realized.
Couple of questions wrt to this:
Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have “awareness” of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like “You are the CEO of Walmart, you want to make the company maximally profitable” in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in “Defining and characterizing reward gaming” and the solution of “Stop before you encounter the attractors/hackable policy” seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don’t count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it’s LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?
EDIT: aren’t we risking that all the trope on evil AI act as an attractor for LLMs?