Perhaps your outer RL algorithm is getting very sparse rewards, and so does not learn very fast. The inner RL could implement its own reward function, which gives faster feedback and therefore accelerates learning. This is closer to the story in Evan’s mesa-optimization post, just replacing search with RL.
More likely perhaps (based on my understanding), the outer RL algorithm has a learning rate that might be too slow, or is not sufficiently adaptive to the situation. The inner RL algorithm adjusts its learning rate to improve performance.
I would be more inclined towards a more general version of the latter view, in which gradient updates just aren’t a very effective way to track within-episode information.
The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you’re training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that’s easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases—its updates need not be local.
This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.
I would be more inclined towards a more general version of the latter view, in which gradient updates just aren’t a very effective way to track within-episode information.
The central example of learning-to-learn is a policy that effectively explores/exploits when presented with an unknown bandit from within the training distribution. An optimal policy essentially needs to keep track of sufficient statistics of the reward distributions for each action. If you’re training a memoryless policy for a fixed bandit problem using RL, then the only way of tracking the sufficient stats you have is through your weights, which are changed through the gradient updates. But the weight-space might not be arranged in a way that’s easily traversed by local jumps. On the other hand, a meta-trained recurrent agent can track sufficient stats in its activations, traversing the sufficient statistic space in whatever way it pleases—its updates need not be local.
This has an interesting connection to MAML, because a converged memoryless MAML solution on a distribution of bandit tasks will presumably arrange the part of its weight-space that encodes bandit sufficient statistics in a way that makes it easy to traverse via SGD. That would be a neat (and not difficult) experiment to run.