Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent
I think I believe the first claim, which I understand to mean “early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective.” But that wasn’t actually a point I was trying to make in this post.
in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the “and then SGD hits a ‘global reward optimum’” picture. I’m going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in ϵ-greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers… The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it’s directly climbing the reward landscape). For one, the DQN agent just isn’t updating off of experiences it hasn’t had.
For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):
In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future. Effectively, this means that the agent’s “gradient”/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.
So it’s not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn’t have an extremely strong “forced to try all actions forever” guarantee (which it won’t, because it’s embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
I did leave a footnote:
Of course, credit assignment doesn’t just reshuffle existing thoughts. For example, SGD raises image classifiers out of the noise of the randomly initialized parameters. But the refinements are local in parameter-space, and dependent on the existing weights through which the forward pass flowed.
However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.[1]
Here’s one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent’s gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment.
For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs “right” when the exit is to the right, but which doesn’t output “right” when there’s a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register.
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
I don’t know what you mean by a “perfect” reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making.
Not claiming that people are pure RL. Let’s wait until future posts to discuss.
(I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.)
Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process.
I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining
The point is that the argument “We’re selecting for agents on reward → we get an agent which optimizes reward” is locally invalid. “We select for agents on X → we get an agent which optimizes X” is not true for the case of evolution (which didn’t find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn’t necessarily holdin the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B.
This is not mechanistic, as I use the word. I understand “mechanistic” to mean something like “Explaining the causal chain by which an event happens”, not just “Explaining why an event should happen.” However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.
The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained “I don’t see why you think RL policies ever learn anything”, which would make an analogous point.
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
Thanks for the detailed comment. Overall, it seems to me like my points stand, although I think a few of them are somewhat different than you seem to have interpreted.
I think I believe the first claim, which I understand to mean “early-/mid-training AGI policies consist of contextually activated heuristics of varying sophistication, instead of e.g. a globally activated line of reasoning about a crisp inner objective.” But that wasn’t actually a point I was trying to make in this post.
Depends. This describes vanilla PG but not DQN. I think there are lots of complications which throw serious wrenches into the “and then SGD hits a ‘global reward optimum’” picture. I’m going to have a post explaining this in more detail, but I will say some abstract words right now in case it shakes something loose / clarifies my thoughts.
Critic-based approaches like DQN have a highly nonstationary loss landscape. The TD-error loss landscape depends on the action replay buffer; the action replay buffer depends on the policy (in ϵ-greedy exploration, the greedy action depends on the Q-network); the policy depends on past updates; the past updates depend on past action replay buffers… The high nonstationarity in the loss landscape basically makes gradient hacking easy in RL (and e.g. vanilla PG seems to confront similar issues, even though it’s directly climbing the reward landscape). For one, the DQN agent just isn’t updating off of experiences it hasn’t had.
For a sufficient situation illustrating this kind of problem, consider a smart reflective agent which has historically had computations reinforced when it attained a raspberry (with reward 1):
In this new task, this agent has to navigate a maze to get the 100-reward blueberry. Will agents be forced to get the blueberry?
Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future. Effectively, this means that the agent’s “gradient”/expected-update in the reward landscape is zero along dimensions which would increase the probability it gets blueberries.
So it’s not just a matter of SGD being suboptimal given a fixed data distribution. If the agent doesn’t have an extremely strong “forced to try all actions forever” guarantee (which it won’t, because it’s embedded and can modify its own learning process), the reward landscape is full of stable attractors which enforce zero exploration towards updates which would push it towards becoming a wireheader, and therefore its expected-update will be zero along these dimensions. More extremely, you can have the inner agent just stop itself from being updated in certain ways (in order to prevent value drift towards reward-optimization); this intervention is instrumentally convergent.
I did leave a footnote:
However, I think your comment deserves a more substantial response. I actually think that, given just the content in the post, you might wonder why I believe SGD can train anything at all, since there is only noise at the beginning.[1]
Here’s one shot at a response: Consider an online RL setup. The gradient locally changes the computations so as to reduce loss or increase the probability of taking a given action at a given state; this process is triggered by reward; an agent’s gradient should most naturally hinge on modeling parts of the world it was (interacting with/observing/representing in its hidden state) while making this decision, and not necessarily involve modeling the register in some computer somewhere which happens to e.g. correlate perfectly with the triggering of credit assignment.
For example, in the batched update regime, when an agent gets reinforced for completing a maze by moving right, the batch update will upweight decision-making which outputs “right” when the exit is to the right, but which doesn’t output “right” when there’s a wall to the right. This computation must somehow distinguish between exits and walls in the relevant situations. Therefore, I expect such an agent to compute features about the topology of the maze. However, the same argument does not go through for developing decision-relevant features computing the value of the antecedent-computation-reinforcer register.
I don’t know what you mean by a “perfect” reward signal, or why that has something to do with exploration difficulty, or why no exploration is needed for my arguments to go through? I think if we assume the agent is forced to wirehead, it will become a wireheader. This implies that my claim is mostly focused on exploration & gradient hacking.
Not claiming that people are pure RL. Let’s wait until future posts to discuss.
Seems unrelated to me; considerable complexity in human behavior does not imply considerable complexity in the learning algorithm; GPT-3 is far more complex than its training process.
The point is that the argument “We’re selecting for agents on reward → we get an agent which optimizes reward” is locally invalid. “We select for agents on X → we get an agent which optimizes X” is not true for the case of evolution (which didn’t find inclusive-genetic-fitness optimizers), so it is not true in general, so the implication doesn’t necessarily hold in the AI reward-selection case. Even if RL did happen to train reward optimizers and this post were wrong, the selection argument is too weak on its own to establish that conclusion.
This is not mechanistic, as I use the word. I understand “mechanistic” to mean something like “Explaining the causal chain by which an event happens”, not just “Explaining why an event should happen.” However, it is an argument for the latter, and possibly a good one. But the supervised case seems way different than the RL case.
The GPT-3 example is somewhat different. Supervised learning provides exact gradients towards the desired output, unlike RL. However, I think you could have equally complained “I don’t see why you think RL policies ever learn anything”, which would make an analogous point.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)