On the other hand, OpenAI Five (AN #13) also has many, many subtasks, that in theory should interfere with each other, and it still seems to train well.
True, but OA5 is inherently a different setup than ALE. Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn’t have an equivalent in your standard ALE; the replay buffer typically turns over so old experiences disappear, and there’s no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space. Since Rainbow is an off-policy DQN, I think you could try saving old checkpoints and periodically spending a few episodes running old checkpoints and adding the experience samples to the replay buffer, but that might not be enough.
There’s also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.
In Gmail, everything after
They also present some fine-grained experiments which show that for a typical agent, training on specific contexts adversely affects performance on other contexts that are qualitatively different.
True, but OA5 is inherently a different setup than ALE.
I broadly agree with this, but have some nitpicks on the specific things you mentioned.
Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn’t have an equivalent in your standard ALE
But since you’re always starting from the same state, you always have to solve the earlier subtasks? E.g. in Montezuma’s revenge in every trajectory you have to successfully get the key and climb the ladder; this doesn’t change as you learn more.
there’s no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space
The thing about Montezuma’s revenge and similar hard exploration tasks is that there’s only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn’t forget things.
There’s also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.
Agreed, but the Memento observation also shows that the problem isn’t about exploration: if you make a literal copy of the agent that gets 6600 reward and train that from the 6600 reward states, it reliably gets more reward than the original agent got. The only difference between the two situations is that in the original situation, the original agent still had to remember how to get to the 6600 reward states in order to maintain its performance, while the new agent was allowed to start directly from that state and so was allowed to forget how to get to the 6600 reward states.
In particular, I would guess that the original agent does explore trajectories in which it gets higher reward (because the Memento agent definitely does), but for whatever reason it is unable to learn as effectively from those trajectories.
Is cut off by default due to length.
Thanks, we noticed this after we sent it out (I think it didn’t happen in our test emails for whatever reason). Hopefully the kinks in the new design will be worked out by next week.
(That being said, I’ve seen other newsletters which are always cut off by GMail, so it may not be possible to do this when using a nice HTML design… if anyone knows how to fix this I’d appreciate tips.)
The thing about Montezuma’s revenge and similar hard exploration tasks is that there’s only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn’t forget things.
But is it easier to remember things if there’s more than one way to do them?
If you aren’t forced to learn all the ways of doing the task, then you should expect the neural net to learn only one of the ways. So maybe it’s that the adversarial nature of OpenAI Five forced it to learn all the ways, and it was then paradoxically easier to remember all of the ways than just one of the ways.
Intuitively, if you forget how to do something one way, but you remember how to do it other ways, then that could make figuring out the other way again easier, thought I don’t have a reason to suspect that would be the case for NNs/etc, and might depend on the specifics of the task.
True, but OA5 is inherently a different setup than ALE. Catastrophic forgetting is at least partially offset by the play against historical checkpoints, which doesn’t have an equivalent in your standard ALE; the replay buffer typically turns over so old experiences disappear, and there’s no adversarial dynamics or AlphaStar-style population of agents which can exploit forgotten area of state-space. Since Rainbow is an off-policy DQN, I think you could try saving old checkpoints and periodically spending a few episodes running old checkpoints and adding the experience samples to the replay buffer, but that might not be enough.
There’s also the batch size. The OA5 batch size was ridiculously large. Given all of the stochasticity in a DoTA2 game & additional exploration, that covers an awful lot of possible trajectories.
In Gmail, everything after
Is cut off by default due to length.
I broadly agree with this, but have some nitpicks on the specific things you mentioned.
But since you’re always starting from the same state, you always have to solve the earlier subtasks? E.g. in Montezuma’s revenge in every trajectory you have to successfully get the key and climb the ladder; this doesn’t change as you learn more.
The thing about Montezuma’s revenge and similar hard exploration tasks is that there’s only one trajectory you need to learn; and if you forget any part of it you fail drastically; I would by default expect this to be better than adversarial dynamics / populations at ensuring that the agent doesn’t forget things.
Agreed, but the Memento observation also shows that the problem isn’t about exploration: if you make a literal copy of the agent that gets 6600 reward and train that from the 6600 reward states, it reliably gets more reward than the original agent got. The only difference between the two situations is that in the original situation, the original agent still had to remember how to get to the 6600 reward states in order to maintain its performance, while the new agent was allowed to start directly from that state and so was allowed to forget how to get to the 6600 reward states.
In particular, I would guess that the original agent does explore trajectories in which it gets higher reward (because the Memento agent definitely does), but for whatever reason it is unable to learn as effectively from those trajectories.
Thanks, we noticed this after we sent it out (I think it didn’t happen in our test emails for whatever reason). Hopefully the kinks in the new design will be worked out by next week.
(That being said, I’ve seen other newsletters which are always cut off by GMail, so it may not be possible to do this when using a nice HTML design… if anyone knows how to fix this I’d appreciate tips.)
But is it easier to remember things if there’s more than one way to do them?
Unclear, seems like it could go either way.
If you aren’t forced to learn all the ways of doing the task, then you should expect the neural net to learn only one of the ways. So maybe it’s that the adversarial nature of OpenAI Five forced it to learn all the ways, and it was then paradoxically easier to remember all of the ways than just one of the ways.
Intuitively, if you forget how to do something one way, but you remember how to do it other ways, then that could make figuring out the other way again easier, thought I don’t have a reason to suspect that would be the case for NNs/etc, and might depend on the specifics of the task.