Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)
Any efficient model-based agent will use learned value functions, so in practice the difference between model-based and model-free blurs for efficient designs. The model-based planning generates rollouts that can help better train the ‘model free’ value function.
Efficientzero uses all that, and like I said—it does not exhibit this failure mode, it will get the blueberry. If the model planning can predict a high gradient update for the blueberry then it already has implicitly predicted a high utility for the blueberry, and EZ’s update step would then correctly propagate that and choose the high utility path leading to the blueberry.
Nor does the meta prediction about avoiding gradients carry through. If it did then EZ wouldn’t work at all, because every time it finds a new high utility plan is the equivalent of the blueberry situation.
Just because the value function can become misaligned with the utility function in theory does not imply that such misalignment always occurs or occurs with any specific frequency. (there are examples from humans such as OCD habits for example, which seems like an overtrained and stuck value function, but that isn’t a universal failure mode for all humans let alone all agents)