Not if exploration is on-policy, or if the agent reflectively models and affects its training process. In either case, the agent can zero out its exploration probability of the maze, so as to avoid predictable value drift towards blueberries. The agent would correctly model that if it attained the blueberry, that experience would enter its data distribution and the agent would be updated so as to navigate towards blueberries instead of raspberries, which leads to fewer raspberries, which means the agent doesn’t navigate to that future.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations.
This is not just purely speculation in the sense that you can run efficient zero in scenarios like this, and I bet it goes for the blueberry.
Your mental model seems to assume pure model-free RL trained to the point that it gains some specific model-based predictive planning capabilities without using those same capabilities to get greater reward.
Humans often intentionally avoid some high reward ‘blueberry’ analogs like drugs using something like the process you describe here, but hedonic reward is only one component of the human utility function, and our long term planning instead optimizes more for empowerment—which is usually in conflict with short term hedonic reward.
Long before they knew about reward circuitry, humans noticed that e.g. vices are behavioral attractors, with vice → more propensity to do the vice next time → vice, in a vicious cycle. They noticed that far before they noticed that they had reward circuitry causing the internal reinforcement events. If you’re predicting future observations via eg SSL, I think it becomes important to (at least crudely) model effects of value drift during training.
I’m not saying the AI won’t care about reward at all. I think it’ll be a secondary value, but that was sideways of my point here. In this quote, I was arguing that the AI would be quite able to avoid a “vice” (the blueberry) by modeling the value drift on some level. I was showing a sufficient condition for the “global maximum” picture getting a wrench thrown in it.
When, quantitatively, should that happen, where the agent steps around the planning process? Not sure.
I think I have some idea what TurnTrout might’ve had in mind here. Like us, this reflective agent can predict the future effects of its actions using its predictive model, but its behavior is still steered by a learned value function, and that value function will by default be misaligned with the reward calculator/reward predictor. This—a learned value function—is a sensible design for a model-based agent because we want the agent to make foresighted decisions that generalize to conditions we couldn’t have known to code into the reward calculator (i.e. searching in a part of the chess move tree that “looks promising” according to its value function, even if its model does not predict that a checkmate reward is close at hand).
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)