If this hypothesis were true, there would be an easy way to improve performance: once you have learned to perform the first subtask, just create a brand new neural net for the next subtask, so that training for this next subtask doesn’t interfere with past learning. Since the new agent has no information about what happened in the past, and must just “pick up” from wherever the previous agent left off, it is called the Memento agent (a reference to the movie of the same name). One can then solve the entire task by executing each agent in sequence.
This leaves unclear how it is decided that old “agents” should be used.
The paper says:
Crucially, non-interference between the two learners is identified as the salient difference.
It seems earlier agents are activated only when (what might be called) “their checkpoint” is reached (‘after’ the game starts over). This makes it seem like once an agent is no longer the cutting edge it is frozen, and might (depending on the environment) be replaceable* by a replay of actions.
*Not that this replacement takes place.
Was this switch to a new agent automatic or done by hand? (Was ‘the agent has plateaued’ determined by a program or the authors of the paper?)
Furthermore, in MONTEZUMA’S REVENGE we find that instead starting the second agent as randomly initialized (the weights encode no past context) performs equivalently.
We have now seen that learning about one portion of PONG or QBERT typically improves the prediction error in other contexts, that is, learning generalizes. However, we now present that this is not generally the case.
Not apparent.
independent reproducibility requires us to be able to reproduce similar results using only what is written in the paper. Crucially, this excludes using the author’s code.
The alternative might be interesting. Given the code, but not the paper, see what insights can be found.
I’d be interested to see (and curious if the authors tried) more continuous variants of this where older information is compressed at a higher rate than newer information, since it seems rather arbitrary to split into two FIFO queues where one has a fixed compression rate.
Set a size for queues, when one is full, a new one is made. (Continuity seems expensive if it means compressing every time something is added.)
This leaves unclear how it is decided that old “agents” should be used.
Yeah, it’s complicated and messy and not that important for the main point of the paper, so I didn’t write about it in the summary.
Was this switch to a new agent automatic or done by hand? (Was ‘the agent has plateaued’ determined by a program or the authors of the paper?)
Automatic / program. See Section 4, whose first sentence is “To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent.”
(The algorithm cheats a bit by assuming that you can run the original agent for some additional time, but then “roll it back” to the first state at which it got the max reward along the trajectory.)
Not apparent.
I may be missing your point, but isn’t the fact that the Memento agent works on Montezuma’s Revenge evidence that learning is not generalizing across “sections” in Montezuma’s Revenge?
I may be missing your point, but isn’t the fact that the Memento agent works on Montezuma’s Revenge evidence that learning is not generalizing across “sections” in Montezuma’s Revenge?
I was indicating that I hadn’t found the answer I sought (but I included those quotes because they seemed interesting, if unrelated).
Automatic / program. See Section 4, whose first sentence is “To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent.”
Thanks for highlighting that. The reason I was interested is because I was thinking of the neural networks as being deployed to complete tasks rather than the entire game by themselves.
I ended up concluding the game was being divided up into ‘parts’ or epochs, each with their own respective agent deployed in sequence. The “this method makes things easy as long as there’s not interference” thing is interesting when compared to multi-agent learning—they’re on the same team, but cooperation doesn’t seem to be easy under these circumstances (or at least not an efficient strategy, in terms of computational constraints), and reminded me of my questions about those approaches, like: Does freezing one agent (for a round) so it’s predictable, then train the other one (or have it play with a human) improve things? How can ‘learning to cooperate better’ be balanced with ‘continuing to be able to cooperate/coordinate with the other player’?
This leaves unclear how it is decided that old “agents” should be used.
The paper says:
It seems earlier agents are activated only when (what might be called) “their checkpoint” is reached (‘after’ the game starts over). This makes it seem like once an agent is no longer the cutting edge it is frozen, and might (depending on the environment) be replaceable* by a replay of actions.
*Not that this replacement takes place.
Was this switch to a new agent automatic or done by hand? (Was ‘the agent has plateaued’ determined by a program or the authors of the paper?)
Not apparent.
The alternative might be interesting. Given the code, but not the paper, see what insights can be found.
Set a size for queues, when one is full, a new one is made. (Continuity seems expensive if it means compressing every time something is added.)
Yeah, it’s complicated and messy and not that important for the main point of the paper, so I didn’t write about it in the summary.
Automatic / program. See Section 4, whose first sentence is “To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent.”
(The algorithm cheats a bit by assuming that you can run the original agent for some additional time, but then “roll it back” to the first state at which it got the max reward along the trajectory.)
I may be missing your point, but isn’t the fact that the Memento agent works on Montezuma’s Revenge evidence that learning is not generalizing across “sections” in Montezuma’s Revenge?
I was indicating that I hadn’t found the answer I sought (but I included those quotes because they seemed interesting, if unrelated).
Thanks for highlighting that. The reason I was interested is because I was thinking of the neural networks as being deployed to complete tasks rather than the entire game by themselves.
I ended up concluding the game was being divided up into ‘parts’ or epochs, each with their own respective agent deployed in sequence. The “this method makes things easy as long as there’s not interference” thing is interesting when compared to multi-agent learning—they’re on the same team, but cooperation doesn’t seem to be easy under these circumstances (or at least not an efficient strategy, in terms of computational constraints), and reminded me of my questions about those approaches, like: Does freezing one agent (for a round) so it’s predictable, then train the other one (or have it play with a human) improve things? How can ‘learning to cooperate better’ be balanced with ‘continuing to be able to cooperate/coordinate with the other player’?