I think you’re still baking in some implicit assumptions about the RL algorithm being weak. The RL algorithm is not obliged to take the current policy, observe its actions, and perturb it a little. At the most extreme end, the most general possible RL algorithm is a giant NN with a powerful world model that picks policies to run experiments with very carefully to gain the maximum number of bits of information for updating its world model. For a less extreme example, consider if the human reward circuitry did a little bit of model based planning and tried to subtly steer you towards addictive things that your neocortex had not realized was addictive yet. It could also simply be that the inductive biases favor wireheading: if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I also don’t know if the human example fully serves the point you’re trying to make, because the very dumb and simple human RL algorithm does in fact sometimes win out against the human.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.
I think you’re still baking in some implicit assumptions about the RL algorithm being weak. The RL algorithm is not obliged to take the current policy, observe its actions, and perturb it a little. At the most extreme end, the most general possible RL algorithm is a giant NN with a powerful world model that picks policies to run experiments with very carefully to gain the maximum number of bits of information for updating its world model. For a less extreme example, consider if the human reward circuitry did a little bit of model based planning and tried to subtly steer you towards addictive things that your neocortex had not realized was addictive yet. It could also simply be that the inductive biases favor wireheading: if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I also don’t know if the human example fully serves the point you’re trying to make, because the very dumb and simple human RL algorithm does in fact sometimes win out against the human.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.