Enjoyed this post. Though it uses different terminology from the shard theory posts, I think it hits on similar intuitions. In particular it hits on some key identifiability problems that show up in the embedded context.
Thus, for sufficiently powerful RL algorithms (not policies!), we should expect them to tend to choose policies which implement wireheading.
I think that by construction, policies and policy fragments that exploit causal short circuits between their actions and the source of reinforcement will be reinforced if they are instantiated. That seems like a relatively generic property of RL algorithm design. But note that this is conditional on “if they are instantiated”. In general, credit assignment can only accurately reinforce what it gets feedback on, not based on counterfactual computations (because you don’t know what feedback you would have otherwise gotten). For example, when you compute gradients on a program that contains branching control flow, the gradients computed are based on how the computations on the actual branch taken contributed to the output, not based on how computations on other branches would have contributed.
For this reason, I think selection-level arguments around what we should expect of RL agents are weak, because they fail to capture the critical dependency of policy-updates on the existing policy. I expect that dependency to hold regardless of how “strong” the RL algorithm is, and especially so when the policy in question is clever in the way a human is clever. Maybe there is some way to escape this and thereby construct an RL method that can route around self-serving policies towards reward-maximizing ones, but I am doubtful.
In particular, one consequence of this is we also don’t need to postulate the existence of some kind of special as yet unknown algorithm that only exists in humans to be able to explain why humans end up caring about things in the world. Whether humans wirehead is determined by the same thing that determines whether RL agents wirehead.
I think you’re still baking in some implicit assumptions about the RL algorithm being weak. The RL algorithm is not obliged to take the current policy, observe its actions, and perturb it a little. At the most extreme end, the most general possible RL algorithm is a giant NN with a powerful world model that picks policies to run experiments with very carefully to gain the maximum number of bits of information for updating its world model. For a less extreme example, consider if the human reward circuitry did a little bit of model based planning and tried to subtly steer you towards addictive things that your neocortex had not realized was addictive yet. It could also simply be that the inductive biases favor wireheading: if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I also don’t know if the human example fully serves the point you’re trying to make, because the very dumb and simple human RL algorithm does in fact sometimes win out against the human.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.
Enjoyed this post. Though it uses different terminology from the shard theory posts, I think it hits on similar intuitions. In particular it hits on some key identifiability problems that show up in the embedded context.
I think that by construction, policies and policy fragments that exploit causal short circuits between their actions and the source of reinforcement will be reinforced if they are instantiated. That seems like a relatively generic property of RL algorithm design. But note that this is conditional on “if they are instantiated”. In general, credit assignment can only accurately reinforce what it gets feedback on, not based on counterfactual computations (because you don’t know what feedback you would have otherwise gotten). For example, when you compute gradients on a program that contains branching control flow, the gradients computed are based on how the computations on the actual branch taken contributed to the output, not based on how computations on other branches would have contributed.
For this reason, I think selection-level arguments around what we should expect of RL agents are weak, because they fail to capture the critical dependency of policy-updates on the existing policy. I expect that dependency to hold regardless of how “strong” the RL algorithm is, and especially so when the policy in question is clever in the way a human is clever. Maybe there is some way to escape this and thereby construct an RL method that can route around self-serving policies towards reward-maximizing ones, but I am doubtful.
Yep. This was one of the conclusions I took away from Reward is not the optimization target.
I think you’re still baking in some implicit assumptions about the RL algorithm being weak. The RL algorithm is not obliged to take the current policy, observe its actions, and perturb it a little. At the most extreme end, the most general possible RL algorithm is a giant NN with a powerful world model that picks policies to run experiments with very carefully to gain the maximum number of bits of information for updating its world model. For a less extreme example, consider if the human reward circuitry did a little bit of model based planning and tried to subtly steer you towards addictive things that your neocortex had not realized was addictive yet. It could also simply be that the inductive biases favor wireheading: if the simplest policy is one that cares about memory registers, then you will probably just end up with that.
I also don’t know if the human example fully serves the point you’re trying to make, because the very dumb and simple human RL algorithm does in fact sometimes win out against the human.
Not sure where the disagreement is here, if there is one. I agree that there are lots of sophisticated RL setups possible, including ones that bias the agent towards caring about certain kinds of things (like addictive sweets or wireheading). I also agree that even a basic RL algorithm like TD learning may be enough to steer the cognition of complex, highly capable agents in some cases. What is it that you are saying depends on the RL algorithm being weak?
I am generally very skeptical about simplicity arguments. I think that that sort of reasoning generally does not work for making specific predictions about how neural networks will behave, particularly in the context of RL.