I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
Nice prediction!
I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
Nice prediction!
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.